Prediction of the elastic properties and electrical resistance of halide glass based on interpretable machine learning

Jiayang Zhou a and Xiangfu Wang *ab
aCollege of Electronic and Optical Engineering & College of Flexible Electronics (Future Technology), Nanjing University of Posts and Telecommunications, Nanjing 210023, China. E-mail: xfwang@njupt.edu.cn
bThe Key Laboratory of Radio and Micro-Nano Electronics of Jiangsu Province, Nanjing 210023, China

Received 31st August 2025 , Accepted 7th November 2025

First published on 8th November 2025


Abstract

Halide glass is indispensable for high-end optoelectronic devices because of its unique photoelectric properties. Its elastic modulus and electrical resistance directly determine the mechanical reliability and signal integrity of these devices. However, conventional experimental fabrication and testing methods cannot establish an accurate structure–property relationship. Halide glass is highly susceptible to moisture and crystallization. Microcracks, porosity, and surface roughness markedly degrade its optoelectronic performance. To overcome these challenges, this study builds interpretable machine-learning models that rely solely on chemical composition and elemental physicochemical descriptors. The experimental data were preprocessed using the sine cosine algorithm (SAC) for feature selection and generative adversarial networks (GAN) for data augmentation, establishing a dataset for predicting the elastic properties and electrical resistance of halide glasses. The study evaluated the performance of six traditional machine learning algorithms and four deep learning and neural network algorithms across different task dimensions, achieving good predictive results. Among them, random forest achieved the best performance for the prediction of Young's modulus (R2 = 0.96146). Support vector machine excelled for the prediction of shear modulus (R2 = 0.95129). Decision tree led for the prediction of Poisson's ratio (R2 = 0.96783). The ensemble learning algorithms (LSBoost and XGBoost) performed well (R2 > 0.9) for the prediction of resistivity at different temperatures, while the BP neural network achieved good results across six distinct tasks (R2 > 0.83). The proposed composition-only design strategy offers direct guidance for developing new halide glasses and for computer-aided inverse design.


1 Introduction

Halide glasses are a class of amorphous solid materials. They are formed through covalent–ionic hybrid bonds between halogens (such as fluorine, chlorine, bromine, or iodine) and metal cations (e.g., Al, Zr, In, Zn). Compared to traditional silicate glasses, halide glasses exhibit lower phonon energy,1,2 a broader infrared transmission range,3,4 a higher refractive index,5–7 and tunable dielectric constants.8–10 These properties grant halide glasses irreplaceable application value in high-end electronic and optoelectronic devices. Examples include infrared fiber amplifiers,11,12 laser gain media,13,14 ultrafast nonlinear optical devices,15,16 and low-loss infrared windows.17,18 In these devices, mechanical reliability and signal integrity depend directly on the elastic properties (Young's modulus, shear modulus, and Poisson's ratio) and electrical resistance of the glass. Fast and accurate prediction of these key properties is essential for device miniaturization, long service life, and high environmental stability.

Currently, the elastic properties (Young's modulus, shear modulus, Poisson's ratio) of halide glasses primarily rely on static mechanical testing methods such as nanoindentation,19,20 ultrasonic echo techniques,21 or Brillouin scattering spectroscopy.22–24 These methods are precise, yet they are time-consuming and demand rigorous sample preparation. They cannot span the entire compositional space of multi-component systems. Microcracks, porosity and surface roughness further disturb the results.25 Electrical resistance is usually measured by two-probe or four-probe techniques. These techniques require repeated heating–cooling cycles in different constant-temperature baths at 20 °C, 100 °C, 150 °C and similar points.26 The procedure is lengthy and energy-intensive. At elevated temperatures, halide glasses readily deliquesce and crystallize. Consequently, data drift and poor repeatability arise. A rapid and accurate method that simultaneously predicts elastic properties and temperature-dependent electrical resistance is therefore urgently needed.

With the advancing of computational materials science, researchers have employed molecular dynamics (MD) simulations27,28 and density functional theory (DFT) calculations29,30 to estimate the dielectric properties of oxide glasses. To enhance the accuracy and reliability of predictions, these computational results can be combined with experimental measurements from databases. By comparing simulation and experimental data, researchers can validate the accuracy of their models and identify key areas that require further investigation. However, MD simulations require numerous parameters or approximations to accurately capture dipole moments in dielectric behavior. DFT calculations, due to their enormous computational cost, struggle to reproduce realistic glass structures. Nowadays, machine learning (ML) technology has been widely applied in various fields, such as medicine science,31–37 environmental science,38–40 materials science and engineering.41–48 It provides new opportunities for modeling glass materials. To date, ML has successfully predicted various glass properties. These properties include density,49 refractive index,50,51 glass transition temperature,52 and Vickers hardness.53,54 Although the mechanism-driven approaches have effectively improved data availability and quality and enhanced model generalization and predictive accuracy by building cross-institutional shared databases,55–57 developing material informatics-specific descriptor libraries, integrating physics-guided machine learning, and designing temperature-adaptive learning frameworks, thereby driving the evolution of halide glass design from empirical trial-and-error to mechanism-driven development, realizing the full potential of ML for predicting the elastic properties and electrical resistance of halide glasses still faces two key challenges. The first is to construct a sufficiently comprehensive dataset for halide glasses. The second is to enable ML models to predict the properties of glasses composed of components not present in the training set.

To address these challenges, researchers have developed various descriptors. These descriptors are based on MD simulation parameters, structural information, or elemental properties. Among them, descriptors based on elemental properties are particularly promising.58,59 These descriptors convert the chemical composition of a glass into corresponding elemental physicochemical properties. They have demonstrated effectiveness in predicting temperature-dependent viscosity and electrical resistance.

This study successfully establishes a machine learning framework designed to predict the elastic properties and electrical resistance of halide glasses. We collected and preprocessed experimental data. Employing feature selection and data augmentation techniques, we constructed a high-quality dataset. It lays a solid foundation for model training. For algorithm selection, we systematically evaluated the performance of various ML algorithms. Evaluated traditional ML algorithms included decision trees, random forests, XGBoost, LSBoost, support vector machines (SVM), and Gaussian kernel regression (GKR). Evaluated deep learning algorithms included backpropagation networks (BP), convolutional neural networks (CNN), long short-term memory networks (LSTM), and generalized regression neural networks (GRNN). We found that different algorithms exhibit varying performance across prediction tasks. We identified the best-performing algorithms for different task dimensions. Subsequently, we employed SHAP analysis to investigate the contribution of each input feature to the predictions. This analysis revealed key factors influencing the properties of halide glasses. This aids in the design of glass compositions with targeted elastic properties and electrical resistance. This study demonstrates the capability of machine learning for predicting the elastic properties and temperature-dependent electrical resistance of halide glasses.

2 Method

2.1 Data collection and data preprocessing

Data collection and dataset construction are crucial in the development of machine learning models, as they directly determine the accuracy and generalization ability of the models. In this study, we developed a feature selection strategy by conducting a systematic literature review and collecting experimental data related to halide glass composition, elastic properties (Young's modulus, shear modulus, Poisson's ratio), and electrical resistance (under 20 °C, 100 °C, 150 °C) from academic databases, including Scopus, Web of Science, ScienceDirect, and IEEE Xplore. In the end, we collected a total of 60 data60–77 to form the original dataset, which are listed in Table S1 in SI. In terms of elastic properties, Young's modulus and shear modulus are relatively concentrated in their distribution with a coefficient of variation of approximately 0.2, indicating a good consistency in the elasticity characteristics of these materials. Their average values are 53.2 GPa and 20.7 GPa, respectively. In contrast, the Poisson's ratio data exhibit greater variability and include 15 outliers, suggesting significant differences in the lateral deformation capabilities among various materials. Regarding electrical properties, the electrical resistance at different temperatures (on a logarithmic scale) is very dispersed, with a coefficient of variation greater than 0.6, spanning a wide range from −8.11 to 14.18 at 20 °C. This indicates that the materials cover a broad spectrum from conductors to insulators in terms of electrical performance. Moreover, the resistivity shows a noticeable decrease with increasing temperature, with the average value dropping from 8.21 at 20 °C to 6.03 at 150 °C, a trend that reveals the typical characteristics of semiconductors or insulators. The data management steps involved: initially, screening the literature to select studies that provided the necessary data; then, extracting relevant data from the chosen studies and cross-verifying it for accuracy; followed by cleaning the data to address missing values and outliers, ensuring the integrity of the dataset; and finally, organizing the cleaned data into a structured original dataset, which was presented in tabular form in the SI for access and verification by other researchers.

In the original data, the glass composition was represented by the molar fractions of its components, using descriptors in the chemical composition domain. Although these descriptors are intuitive and easily accessible, machine learning models using only these descriptors cannot predict the elastic properties and electrical resistance of glasses with compositions not present in the original training set. To overcome this limitation and predict the dielectric properties of glasses with unknown compositions, we transformed the descriptors from the chemical composition domain to the domain of elemental physical and chemical properties by extracting stoichiometric features, element property-based features, valence orbital occupancy features, and ionicity features. These atomic-level descriptors, by revealing the fundamental mechanisms of material microstructure, can more finely influence and predict the macroscopic properties of materials. For instance, elements with smaller atomic radii typically form stronger covalent bonds, which may result in a higher Young's modulus, as strong covalent bonds can resist external stress without deformation. Elements with higher electronegativity tend to attract more electron density, which can affect the material's polarization characteristics and dielectric constant. In this way, atomic-level descriptors provide us with a means to predict how the characteristics at the atomic and molecular levels influence the macroscopic physical and chemical properties of materials, thereby enhancing the prediction of elastic and electrical properties.

Stoichiometric features78 reflect the relative proportions of elements, influencing glass structure and bonding. Element property-based features, like atomic radius and electronegativity, affect bond strength and nature, impacting elasticity and electrical resistance. Valence orbital occupancy features indicate electronic configurations, crucial for bonding and charge transport. Ionicity features measure bond ionic character, vital for electrical properties in halide glasses. Together, these features enhance the prediction of elastic and electrical properties. Stoichiometric features describe the molar fractions of elements in the glass composition and are based on the Lp norms (such as L1 and L2 norms) of the vectors representing the molar fractions of each element in the glass. These features help machine learning models understand the impact of different element ratios on the properties of the halide glass, and their calculation formula is:

 
image file: d5tc03267a-t1.tif(1)

In this formula n represents the number of elements in the glass. We selected p = 2, 3, 5, 7 as the norm parameters. The criterion for selecting these norms was that the relative difference between ‖xp and ‖xp+1 should exceed 1%.

Element property features can be obtained through the mendeley (https://github.com/lmmentel/mendeleev) and matminer79 modules in Python. For each elemental feature (denoted as fi), we calculated its weighted average (denoted as [f with combining circumflex]) and average deviation (denoted as [f with combining circumflex]), according to the following formulas:

 
image file: d5tc03267a-t2.tif(2)
 
image file: d5tc03267a-t3.tif(3)

Valence orbital occupancy features80 refer to the weighted average of electron counts in each valence orbital within a specific compound. This feature characterizes the electronic structure properties of elements. The ionic character feature describes the degree of ionic nature of chemical bonds between atoms in a compound. It is calculated based on the difference in electronegativity of elements and is used to predict certain physical and chemical properties of materials. The electronegativity of an element is shown in eqn (4), where xi and xo represent the electronegativity of the constituent element and oxygen, respectively.

 
I(Xi,Xo) = exp(−0.25(XiXo)2)(4)

The overall electronegativity of halide ceramics is shown in eqn (5):

 
image file: d5tc03267a-t4.tif(5)

After establishing the initial dataset, this study constructed a database suitable for machine learning through the following steps: first, feature standardization was applied to eliminate dimensional differences; second, data augmentation techniques were used to enhance sample diversity. Considering the skewed distribution of the data, Z-score normalization81 was adopted after removing extreme outliers. Compared to other normalization methods, Z-score normalization has significant advantages. Firstly, it does not rely on any assumptions about the data distribution shape, directly converting the mean of the data to 0 and the standard deviation to 1. Secondly, Z-score normalization preserves the original distribution characteristics of the data to the greatest extent, avoiding data distortion due to uneven or skewed data distribution. This method provides standardized and comparable data for subsequent analysis, thereby effectively improving the accuracy and reliability of the analysis results. Its core formula is defined as:

 
image file: d5tc03267a-t5.tif(6)

In this formula, x is the original data point, μ is the mean of the data set, and σ is the standard deviation of the data set.

The limited amount of original data may pose potential risks to model training, such as overfitting, which can undermine the generalization ability of the model when dealing with unseen data. Moreover, the lack of diversity and scale of the data may also restrict the robustness to noise and interference of the model. To address the potential overfitting due to limited data and enhance model robustness against noise and interference, strategies such as data augmentation to increase dataset diversity, regularization to reduce model complexity, and cross-validation to assess generalization capability, can be smoothly integrated. To enhance the scale and diversity of the dataset, this study employs generative adversarial networks (GANs)82 for data augmentation. This method aims to improve the generalization ability, adaptability to unseen data, and robustness to noise and interference of the model. GANs consist of two core components: the generator and the discriminator. The generator is responsible for producing synthetic data that follows the distribution of real data, while the discriminator is tasked with distinguishing between real and generated data. During the training phase, the generator and discriminator enhance each other's performance through competitive learning. The generator progressively increases the realism of the generated data, while the discriminator enhances its recognition capabilities. The system reaches a state of equilibrium when the output of the generator is indistinguishable from genuine data. Unlike conventional data augmentation techniques that involve simple transformations such as rotation and scaling, this generative method also produces entirely new data samples, significantly expanding the size and diversity of the dataset. Moreover, data generated by GANs more accurately reflect the characteristics of real-world data distributions. The objective function of GANs can be mathematically expressed as follows:

 
image file: d5tc03267a-t6.tif(7)

To enhance algorithm performance and reduce computational power dependence, we employed the sine cosine algorithm83 for preliminary feature analysis and selection. And ultimately, eighteen input features that have a significant impact on the performance of the model were selected. These features are listed in Table 1. Feature selection allows for the extraction of the most representative and relevant feature subset from raw data. This process reduces data dimensionality, minimizes redundant information and noise, and enhances model generalization capability.

Table 1 Features selected in this study
Ionization_energy1_avg Ionization_energy1_dev
Ionization_energy2_avg Ionization_energy2_dev
Ionization_energy3_avg Ionization_energy3_dev
Atomic_weight_avg Atomic_weight_dev
Atomic_volume_avg Atomic_volume_dev
Dipole_polarizability_avg Dipole_polarizability_dev
Melting_point_avg Melting_point_dev
Boiling_point_avg Boiling_point_dev
Heat_of_formation_avg Heat_of_formation_dev


Rational partitioning of the dataset is a crucial step in model evaluation and has a direct impact on the generalization capability of the model. In this study, the dataset was partitioned into training, validation, and test datasets following an 8[thin space (1/6-em)]:[thin space (1/6-em)]1[thin space (1/6-em)]:[thin space (1/6-em)]1 ratio. The training dataset was primarily used for the learning process of model parameters. The validation dataset was employed for optimizing hyperparameters and selecting the model. The test dataset was reserved for the final assessment of model performance to simulate the behavior on unseen data of the model.

By implementing techniques such as descriptor transformation, data standardization, data augmentation, feature selection and data partitioning, a high-quality dataset was constructed. These methods significantly improved data consistency, usability, and model performance, providing a solid foundation for subsequent machine learning tasks.

2.2 The algorithm of machine learning prediction models

The selection of algorithms is crucial for model performance and accuracy. From a machine learning perspective, we aim to establish a mathematical relationship between the elastic properties and electrical resistance of halide glass and input features through mathematical algorithms. To this end, we evaluated six traditional machine learning algorithms: decision tree (DT),84 random forest (RF),85 least squares boosting (LSBoost),86 eXtreme gradient boosting (XGBoost),87 Gaussian kernel regression (GKR),88 support vector machines (SVM),89 and four deep learning and neural network algorithm: back-propagation (BP),90 long short-term memory networks (LSTM),91 convolutional neural network (CNN),92 and generalized regression neural networks (GRNN).93 Choosing an algorithm suitable for the characteristics of the dataset is essential for enhancing predictive accuracy and improving model performance in machine learning applications.

Decision trees are intuitive machine learning algorithms used for classification and regression tasks. They recursively select the best features to split the dataset, constructing a tree model where internal nodes represent feature tests, branches are test outcomes, and leaf nodes provide the final predictions. This algorithm is easy to understand and implement but may be prone to overfitting.

Random forest is an ensemble learning algorithm that improves prediction accuracy and stability by combining multiple decision trees. It generates multiple subsets by sampling with replacement from the original dataset, training a decision tree on each subset. Feature selection is randomized during tree construction to enhance diversity. The final prediction is the mean of all tree outputs, effectively preventing overfitting.

The concept of ensemble learning is exemplified by random forest, which enhances overall performance by integrating multiple models. EXtreme gradient boosting (XGBoost) further develops this concept by recursively optimizing the loss function through gradient boosting, making it particularly effective for large datasets. However, XGBoost may require extensive hyperparameter tuning for optimal performance.

Similar to XGBoost, least square boosting (LSBoost) employs a gradient boosting strategy but uses least squares to handle residuals. This method may be more straightforward in some cases but might be less efficient than XGBoost. On the other hand, Gaussian kernel regression offers a non-parametric approach to data processing, ideal for capturing complex nonlinear relationships, although it may require more data for accurate predictions.

Support vector machine (SVM) adopts a distinct strategy by finding the optimal separating hyperplane in feature space for classification. For regression tasks, SVM can handle nonlinear issues through kernel tricks, making it particularly effective in small-sample learning but potentially challenging computationally with large datasets.

Gaussian kernel regression (GKR) employs a kernel trick. It maps input data into a high-dimensional space using the Gaussian kernel function to capture complex nonlinear relationships. This method is ideal for patterns that traditional linear models cannot capture. However, the performance of GKR depends heavily on kernel parameters like bandwidth, which need careful tuning. As a non-parametric method, GKR may require more data and has higher computational costs with larger datasets. Despite these challenges, GKR remains a valuable tool in machine learning, offering flexibility and adaptability for various applications.

It is worth noting that, due to the implementation principles of the above algorithms, they are often used for problems with multiple inputs and a single output. In this study, we adopted the method of transforming multi-input multi-output problems into multiple multi-input single-output problems to ensure the algorithms perform optimally. In contrast, deep learning and neural network algorithms can directly handle multi-input multi-output tasks, offering significant advantages in efficiency.

Back-propagation (BP) is the key algorithm for deep learning and neural network neural networks. It updates weights by calculating the gradient of the loss function with respect to the weights of network, minimizing prediction errors. BP is the foundation for training most neural networks, including LSTM networks and CNN networks.

Long short-term memory (LSTM) is a special type of recurrent neural network (RNN) that addresses the vanishing gradient problem in traditional RNNs when dealing with long sequences by introducing gating mechanisms. LSTM is particularly suitable for processing time series data, such as stock price prediction and natural language processing tasks. LSTM uses the BP algorithm to train its complex gating structures, enabling it to effectively capture long-term dependencies.

Convolutional neural network (CNN) is another widely used type of neural network. CNN automatically extracts local features through convolutional layers and reduces the dimensionality of features with pooling layers. CNN excels in image recognition and classification tasks, such as face recognition and object detection. CNN also relies on the BP algorithm to train their convolutional and pooling layers, optimizing network performance.

Given the small sample size in this study, the generalized regression neural network (GRNN) proves advantageous. This probabilistic model-based neural network utilizes radial basis functions for nonlinear mapping of input data. GRNN predicts by averaging the local values around data points, capturing local patterns effectively. A notable feature of GRNN is its direct model construction from input data without iterative optimization, enhancing efficiency and reducing overfitting risks in small-sample scenarios. While GRNN excels in small-sample learning, it might not perform as well as more complex models on larger datasets. Thus, GRNN is also considered for this research due to its suitability for limited data.

2.3 Shap analysis

In the field of machine learning, the interpretability of models is crucial. SHapley additive exPlanations (SHAP)94 is a powerful tool for model interpretation. Its theoretical foundation lies in the Shapley value from cooperative game theory. Currently, SHAP is widely applied across various domains. This method estimates the impact of features by observing the behavior of the model with and without specific features. Thus, the problem of attributing functionality is transformed into a cooperative game theory problem. The SHAP value provides a unique and consistent local exact explanation for calculating each contribution of the feature to the final output in the absence of attributes (where missing features are assigned a value of zero). However, SHAP analysis has limitations when applied to complex models, such as it assumes independence among features, which may not hold in models with highly correlated features, affecting the accuracy of the interpretation. Despite these limitations, SHAP analysis can still provide an in-depth understanding of the model's predictive behavior, and the calculation method for SHAP values is as follows:
 
image file: d5tc03267a-t7.tif(8)

In this formula, N is the set of all features and S is any subset of features that does not include feature i. The notation |S| denotes the number of features in set S. The function ν(S) represents the contribution of feature set S to the predictive output of the model. The function ν(S ∪ {i}) represents the contribution of the feature set S ∪ {i}, which includes feature i, to the predictive output.

In this study, we calculated the Shapley values for each feature using the SHAP package of Python to explain the output we obtained. Additionally, we utilized scatter plots, bar plots, heatmaps of mean bar charts, and bee swarm plots to visualize the results of the SHAP analysis. Specifically, in the context of predicting the performance of halide glasses, these SHAP values provide a detailed breakdown of how features such as chemical composition and microstructure contribute to the predictions of properties like Young's modulus and electrical resistivity. By quantifying the impact of each feature with a SHAP value, we can pinpoint which elements or structural aspects are most influential in determining the material's performance under varying conditions. This detailed analysis not only enhances the interpretability of our model's predictions but also informs future material design and optimization efforts by highlighting the critical factors affecting the properties of halide glasses. In the implementation process, we utilized several libraries including Scikit-learn, numpy, pandas, and matplotlib to build and train our models.

3 Results and discussion

This research established a framework utilizing machine learning regression to forecast the elasticity and electrical resistance of halide glasses. It employed an ensemble approach with ten unique regression models. The best algorithm was selected based on the assessment of predictive precision and model fit. Following predictions, the outcomes were combined to create visual analyses comparing experimental and forecasted results. These visual tools convert data into clear graphical forms, aiding in evaluating model efficacy. Moreover, common regression metrics (e.g., R2, RMSE) and SHAP values were incorporated for a thorough understanding of model performance. In conclusion, the machine learning framework proposed here effectively appraised and contrasted various algorithms, showcasing strong analytical strength. Our study utilized high-performance computing resources, including an advanced graphics processor equipped with an NVIDIA RTX 3070 Ti Laptop GPU and multi-core processors, to handle large datasets and complex model training processes. Due to the limited amount of data, model training times were generally less than 60 minutes, allowing us to quickly iterate and optimize model parameters.

3.1 Analysis of data preprocessing

To evaluate the effectiveness of data preprocessing, a comparison was made between the data before and after augmentation. The focus was on assessing whether the diversity and representativeness of the data had been enhanced. To evaluate the data better, we generated a correlation matrix heatmap from normalized data (Fig. 1a). This heatmap visually demonstrates the relationships among variables. The intensity of colors in the heatmap reflects the strength of correlations: red indicates a strong positive correlation, blue signifies a strong negative correlation, and white denotes a weak correlation. Features showing strong correlations are marked with an asterisk. As illustrated in Fig. 1b, the histogram comparing the distribution of original and synthetic data reveals that the synthetic data closely mirrors the distribution pattern of the original data. It suggests that the data augmentation process did not introduce substantial changes to the core characteristics of the data, thus ensuring the effectiveness and reliability of the preprocessing step.
image file: d5tc03267a-f1.tif
Fig. 1 (a) Heatmap of correlation matrix for normalized data of the halide glass. Features that show strong correlation are marked with an asterisk. (b) Histogram of original data and synthetic data after data augmentation.

In order to further substantiate the efficacy of data augmentation, this study utilized robust mathematical techniques for analysis. Initially, the K-means clustering algorithm was employed to segment the original dataset into three distinct clusters (Fig. 2a). Upon comparing the sample counts prior to and following data augmentation, it was observed that the augmentation predominantly augmented the sample sizes of the less represented classes (class 2 and class 3). This observation demonstrates that the proposed data augmentation technique effectively boosts the representation of minority classes, leading to a more balanced data distribution. Such an enhancement holds substantial importance for improving model performance across all classes, especially those with limited sample sizes.


image file: d5tc03267a-f2.tif
Fig. 2 (a) Bar chart showing the number of original samples versus synthetic samples across three classes. (b) 2D t-SNE visualization comparing the distribution of original data with synthetic data.

Additionally, the t-SNE technique was applied to map high-dimensional datasets onto a two-dimensional plane (Fig. 2b), where each dot corresponds to a unique sample. The analysis of data pre- and post-augmentation reveals that the original dataset displays a more scattered distribution of the three classes, with some experiencing a low density of samples. In contrast, the augmented dataset exhibits a more concentrated arrangement of points, with clearer class distinctions. It suggests that data augmentation has successfully enriched the diversity and coverage of the dataset. Consequently, the model is now better equipped to capture the characteristics of various classes during the training phase. The model can thus discern a broader spectrum of feature interactions and attribute alterations, which helps to mitigate overfitting. Such enhancements bolster the model's resilience and capacity to generalize, facilitating its adjustment to intricate, real-world situations.

To summarize, the techniques for data preprocessing and augmentation utilized in this research have demonstrated their efficacy in enhancing data quality. Through systematic assessments like histogram analysis, K-means clustering, and t-SNE visualization, it was evident that data augmentation increased the number of samples in less-represented classes, thus achieving a more equitable class distribution. Additionally, it maintained the statistical characteristics and overall integrity of the initial dataset. The augmented dataset also exhibited increased diversity and a wider scope, which bolstered the robustness, generalization, and feature learning capabilities of the model. These outcomes validate the effectiveness and necessity of the data augmentation strategy for constructing high-performance, dependable predictive models.

3.2 Comparative analysis

3.2.1 Traditional machine learning algorithms.
3.2.1.1 Prediction of Young's modulus. Based on training and test datasets, Fig. 3 illustrates the regression prediction results for Young's modulus using six traditional machine learning algorithms: decision tree, random forest, XGBoost, LSBoost, SVM, and GKR. In the main plot, the x-axis represents actual data, while the y-axis represents predicted data. Each point corresponds to the true value and its predicted value of the sample. The plot also includes fitting curves, fitting equations, and the coefficient of determination (R2). The shaded areas in the plot indicate the 95% confidence intervals for the predictions, showing the range of predicted values. For both the training and testing datasets, these intervals provide a visual representation of the uncertainty associated with the predictions and are crucial for understanding the reliability of the model. Specifically, the narrower the confidence interval, the more precise the predictions are considered to be. Visually, from the plot, it can be observed that the confidence intervals for the random forest and GKR algorithms are relatively narrow, which offers a preliminary yet intuitive reflection of the superior performance of these two models in predicting Young's modulus. Each figure clearly displays the performance of the corresponding algorithm on both the training and testing datasets, with R2 values for the training set typically higher than those for the testing set, reflecting the generalization ability on unseen data of the model. The optimal hyperparameters of different algorithms are found within 100 iterations through Bayesian optimization, and these hyperparameters are listed in Table S2 in the SI. The histogram of residual distributions for various machine learning models to predict the Young's modulus are presented in Fig. S1 in the SI.
image file: d5tc03267a-f3.tif
Fig. 3 Prediction performance of traditional machine learning models for Young's modulus. Scatter plots illustrate the predicted versus actual values for different algorithms: (a) decision tree, (b) random forest, (c) XGBoost, (d) LSBoost, (e) SVM, (f) GKR.

Typically, regression model assessments utilize metrics such as mean absolute error (MAE), mean squared error (MSE), mean absolute percentage error (MAPE), root mean square error (RMSE), and coefficient of determination (R2). The formulas for these are detailed in Table 2. A lower MAE, MSE, MAPE, and RMSE denote superior model efficacy. An R2 value approaching 1 indicates greater model explanatory strength. Specifically, the R2 value reflects the proportion of the variance in the dependent variable that is predictable from the independent variables. For instance, an R2 value of 0.95 means that 95% of the variability in the data can be explained by the model, indicating a very close fit to the actual data. Furthermore, while there may be minor differences in R2 values among algorithms, whether these differences are statistically significant needs to be determined through hypothesis testing. When assessing algorithm performance, we should not only focus on the R2 value but also consider multiple metrics including MAE, MSE, MAPE, and RMSE comprehensively. Table 3 presents specific performance metrics for the six algorithms used to predict the Young's modulus.

Table 2 Calculation formula for model evaluation indicators
Metrics Calculation formula
MAE image file: d5tc03267a-t8.tif
MSE image file: d5tc03267a-t9.tif
MAPE image file: d5tc03267a-t10.tif
RMSE image file: d5tc03267a-t11.tif
R 2 image file: d5tc03267a-t12.tif


Table 3 Evaluation metrics of the algorithms used to predict the Young's modulus
Model MAE MAPE MSE RMSE R 2
Decision tree 3.055 0.059911 13.964 3.7369 0.85135
Random forest 1.3592 0.025463 3.6202 1.9027 0.96146
XGBoost 1.8864 0.03522 5.3438 2.3117 0.94312
LSBoost 3.2503 0.059682 18.637 4.317 0.80161
SVM 1.8125 0.032409 6.3865 2.5272 0.93202
GKR 1.5743 0.02805 4.0328 2.0082 0.95707


To compare the performance of various algorithms in predicting Young's modulus, multiple visualization methods were employed, including bar charts, scatter plots, radar charts, and Taylor diagrams. The bar chart (Fig. 4a) displays the MAE and RMSE for different algorithms. This highlights the performance differences across various error metrics. Random forest and Gaussian kernel regression demonstrated the best performance with the lowest MAE and RMSE values of 1.3592/1.9027 and 1.5743/2.0082 respectively, indicating smaller prediction errors. In contrast, LSBoost showed the poorest performance with the highest MAE and RMSE values of 3.2503/4.317.


image file: d5tc03267a-f4.tif
Fig. 4 Visualized evaluation metrics for prediction of Young's modulus. (a) Bar chart of MAE, MAPE, and RMSE values for different algorithms. (b) Scatter plot depicting the relationship between R2 and MAE for the same set of algorithms. (c) Radar chart of prediction performance for different algorithms. (d) Taylor diagrams comparing models across correlation and standard deviation metrics.

The scatter plot (Fig. 4b) illustrates the relationship between R2 and MAE for each algorithm. Random forest and Gaussian kernel regression had R2 values close to 1, at 0.96146 and 0.95707, and lower MAE values, indicating superior explanatory power and predictive accuracy. Conversely, decision tree had the lowest R2 value of 0.85135 and a higher MAE of 3.055, suggesting relatively poorer performance.

The radar chart (Fig. 4c) assesses the predictive performance of each algorithm across multiple dimensions. Values closer to the center indicate better performance. Random forest exhibited the most compact and centrally located radar contour, signifying excellent performance across all metrics.

The Taylor diagram (Fig. 4d) compares models based on correlation and standard deviation. Gaussian kernel regression and random forest had the highest correlation, close to 1, and lower standard deviations, indicating strong linear relationships between predicted and actual values and stable predictions. The performance differences among algorithms in predicting Young's modulus are mainly due to variations in model structure and generalization ability. For example, random forest improves model stability and prediction accuracy through ensemble learning, while XGBoost is more sensitive to parameter tuning. Additionally, the feature distribution of the dataset also affects algorithm performance. Additionally, the computational efficiency and scalability of random forest were assessed to help determine their suitability for large-scale applications. Due to its tree-based structure, the random forest algorithm natively supports parallel processing, which makes it scalable when dealing with large datasets. Through these visualizations, it is evident that random forest performs best in predicting the Young's modulus of halide glasses and is suitable for large-scale application and deployment.


3.2.1.2 Prediction of shear modulus. Based on training and test datasets, Fig. 5 illustrates the regression prediction results for shear modulus. These algorithms include decision tree, random forest, XGBoost, LSBoost, SVM, and GKR. Table 4 details the quantitative evaluation metrics for each algorithm. The optimal hyperparameters of different algorithms are found within 120 iterations through Bayesian optimization, and these hyperparameters are listed in Table S3 in the SI. Although there are differences in the evaluation metrics across models, ensemble learning models, particularly XGBoost and LSBoost, demonstrate exceptional predictive performance. This is attributed to the ability of ensemble learning to effectively reduce the risk of overfitting and significantly enhance the generalization capability of the algorithm by integrating the strengths of multiple base learners. The histogram of residual distributions for various machine learning models to predict the shear modulus are presented in Fig. S2 in the SI.
image file: d5tc03267a-f5.tif
Fig. 5 Prediction performance of traditional machine learning models for shear modulus. Scatter plots illustrate the predicted versus actual values for different algorithms: (a) decision tree, (b) random forest, (c) XGBoost, (d) LSBoost, (e) SVM, (f) GKR.
Table 4 Evaluation metrics of the algorithms used to predict the shear modulus
Model MAE MAPE MSE RMSE R 2
Decision tree 0.99353 0.045655 1.5349 1.2389 0.88656
Random forest 0.83684 0.038208 1.207 1.0986 0.91079
XGBoost 0.63581 0.028557 0.80397 0.89664 0.94058
LSBoost 0.63832 0.028644 0.7635 0.87378 0.94357
SVM 0.68078 0.031112 0.65908 0.81184 0.95129
GKR 0.71417 0.030819 1.1652 1.0795 0.91388


To systematically evaluate the performance differences of various algorithms in predicting shear modulus, this study employed four types of visualization methods for multidimensional analysis. The bar chart in Fig. 6a shows that XGBoost and LSBoost stand out in error metrics. The MAE of XGBoost is 0.63581 and RMSE of XGBoost is 0.89664, slightly higher than the MAE (0.63832) and RMSE (0.87378) of LSBoost. Although SVM has the best error control with RMSE of 0.81184, its MAE of 0.68078 is still higher than XGBoost and LSBoost. The scatter plot (Fig. 6b) further reveals the balance between accuracy and error: SVM is on the top of the chart with an R2 value of 0.95129, representing the highest predictive accuracy; XGBoost and LSBoost follow closely, forming a high-precision cluster. Decision tree is in the lower left area, showing a significant performance gap.


image file: d5tc03267a-f6.tif
Fig. 6 Visualized evaluation metrics for prediction of shear modulus. (a) Bar chart of MAE, MAPE, and RMSE values for different algorithms. (b) Scatter plot depicting the relationship between R2 and MAE for the same set of algorithms. (c) Radar chart of prediction performance for different algorithms. (d) Taylor diagrams comparing models across correlation and standard deviation metrics.

The radar chart (Fig. 6c) shows that random forest has the best balance in all five dimensions, with shorter line lengths for MAE and RMSE compared to decision tree and GKR. XGBoost and LSBoost excel in the 1 − R2 dimension. The Taylor diagram (Fig. 6d) further confirms this: LSBoost has a correlation coefficient of 0.97, closest to the ideal reference point, better than random forest (0.91) and decision tree (0.89), indicating the best convergence of predicted and actual values.

In the task of predicting shear modulus, LSBoost and XGBoost are confirmed as optimal algorithms due to their outstanding comprehensive performance. LSBoost achieves a dual breakthrough in precision and error control. Its core strength lies in the residual iterative optimization mechanism, which corrects prediction biases round by round through a forward stepwise additive model, thereby continuously enhancing the capture of complex nonlinear material relationships. XGBoost maintains a balance between efficiency and stability. Its gradient boosting framework integrates L1/L2 regularization, which suppresses overfitting and accelerates parameter optimization through parallel computing. In the scatter plot, LSBoost and XGBoost jointly occupy the core area of high precision and low error (R2 > 0.94, MAE < 0.64), significantly outperforming other comparative algorithms, such as SVM with a higher MAE and decision tree, which shows weakness in both metrics. Furthermore, LSBoost, with its iterative optimization mechanism, efficiently handles various datasets and completes training quickly, making it particularly useful in scenarios requiring frequent model iteration and updates. XGBoost leverages gradient boosting and regularization techniques to maintain high accuracy while enhancing training efficiency, and supports parallel computing, further improving its scalability when dealing with large datasets. These advantages make LSBoost and XGBoost ideal choices for large-scale data analysis and predictive tasks.


3.2.1.3 Prediction of Poisson's ratio. Based on comprehensive assessments from training and test datasets, Fig. 5 clearly illustrates the regression prediction results of six traditional machine learning algorithms for Poisson's ratio. Data points in all subplots are closely distributed around the reference diagonal, with R2 values for both training and test datasets generally above 0.85. For instance, the decision tree has an R2 of 0.96783 for the test set, and random forest reaches 0.92653. The concentrated distribution of points visually indicates that all six algorithms demonstrate excellent fitting capabilities in predicting Poisson's ratio, with significant correlation between predicted and actual values. The high precision of all models confirms the universal effectiveness of machine learning algorithms in capturing the complex nonlinear relationships of material Poisson's ratios (Fig. 7). Quantitative evaluation metrics for each algorithm are detailed in Table 5. The optimal hyperparameters of different algorithms are found within 100 iterations through Bayesian optimization, and these hyperparameters are listed in Table S4 in the SI. The histogram of residual distributions for various machine learning models to predict the Poisson's ratio are presented in Fig. S3 in the SI.
image file: d5tc03267a-f7.tif
Fig. 7 Prediction performance of traditional machine learning models for Poisson's ratio. Scatter plots illustrate the predicted versus actual values for different algorithms: (a) decision tree, (b) random forest, (c) XGBoost, (d) LSBoost, (e) SVM, (f) GKR.
Table 5 Evaluation metrics of the algorithms used to predict the Poisson's ratio
Model MAE MAPE MSE RMSE R 2
Decision tree 0.011614 0.039959 0.00021404 0.01463 0.96783
Random forest 0.017091 0.080553 0.00048878 0.022108 0.92653
XGBoost 0.012807 0.042764 0.000241 0.015524 0.96378
LSBoost 0.017224 0.059652 0.00038702 0.019673 0.94183
SVM 0.016307 0.061119 0.00042916 0.020716 0.93549
GKR 0.021079 0.084321 0.0010421 0.032282 0.84336


Based on the quantitative metrics from Table 5 and the visualizations from Fig. 8, the six algorithms exhibit distinct performance tiers in predicting Poisson's ratio. Bar chart (Fig. 8a) shows that the decision tree leads in all metrics, establishing its absolute advantage. Scatter plot (Fig. 8b) indicates that XGBoost ranks second with R2 = 0.96378 and MAE = 0.012807. Its Taylor diagram position (orange) is close to the decision tree, reflecting similar predictive stability. Notably, in Fig. 8b, the decision tree and XGBoost jointly occupy the top right golden area (R2 > 0.96, MAE < 0.013), creating a clear performance gap. Among ensemble models, LSBoost (R2 = 0.94183) and random forest (R2 = 0.92653) outperform GKR but still lag significantly behind the decision tree. Radar chart (Fig. 8c) reveals that, with both expanding noticeably on the 1 − R2 axis, indicating weaker multi-metric balance than the decision tree.


image file: d5tc03267a-f8.tif
Fig. 8 Visualized evaluation metrics for prediction of Poisson's ratio. (a) Bar chart of MAE, MAPE, and RMSE values for different algorithms. (b) Scatter plot depicting the relationship between R2 and MAE for the same set of algorithms. (c) Radar chart of prediction performance for different algorithms. (d) Taylor diagrams comparing models across correlation and standard deviation metrics.

Taylor diagram (Fig. 8d) visually confirms this, with the decision tree's red marker having the highest correlation coefficient (nearly 0.97) and the lowest standard deviation (0.014), showing its predicted distribution almost overlaps with actual values. GKR is the only model with performance collapse, its Taylor diagram position (yellow) far from the center and a high standard deviation of 0.032, echoing its highest RMSE (0.032282) in Table 5.

The root of this performance contrast lies in the deep coupling of data characteristics and algorithm mechanisms: Poisson's ratio often exhibits threshold sensitivity, which the decision tree can accurately capture through recursive splitting; whereas the diversity design of ensemble models, while enhancing generalization, may diminish sensitivity to key thresholds. Ultimately, the four visualizations collectively point to: in Poisson's ratio prediction tasks, the traditionally considered “simple” decision tree, due to its structural characteristics, becomes an unexpectedly optimal solution surpassing modern ensemble algorithms. As for the suitability of decision tree for large-scale applications, assessments indicate that while decision trees perform well with small to medium-sized datasets, they may encounter challenges in computational efficiency and memory management when scaling up to large-scale datasets. Therefore, in large-scale applications, it may be necessary to consider using more efficient algorithms or optimizing the decision tree to adapt to big data environments.


3.2.1.4 Prediction of electrical resistance under 20 °C. Based on comprehensive assessments from training and test datasets, Fig. 9 clearly illustrates the regression prediction results of six traditional machine learning algorithms for electrical resistance at 20 °C. The preliminary analysis indicates that the decision tree, XGBoost, and LSBoost algorithms perform well in the task of predicting electrical resistance, with their R2 coefficients exceeding 0.9, demonstrating high predictive accuracy and model fit. It suggests that these algorithms can effectively capture complex patterns in the data, thus providing reliable predictions. In contrast, random forest, SVM, and GKR perform more modestly, with relatively lower R2 coefficients, indicating that these algorithms might not fully leverage the information in the dataset or capture the nonlinear relationships as accurately when dealing with this specific dataset. Such performance differences could be related to the intrinsic mechanisms of the algorithms and may also be influenced by the characteristics of the data. Quantitative evaluation metrics detailing each algorithm's precision are comprehensively documented in Table 6, and the optimal hyperparameters of different algorithms are found within 140 iterations through Bayesian optimization, and these hyperparameters are listed in Table S5 in the SI. The histogram of residual distributions for various machine learning models to predict the electrical resistance at 20 °C are presented in Fig. S4 in the SI.
image file: d5tc03267a-f9.tif
Fig. 9 Prediction performance of traditional machine learning models for electrical resistance at 20 °C. Scatter plots illustrate the predicted versus actual values for different algorithms: (a) decision tree, (b) random forest, (c) XGBoost, (d) LSBoost, (e) SVM, (f) GKR.
Table 6 Evaluation metrics of the algorithms used to predict the electrical resistance under 20 °C
Model MAE MAPE MSE RMSE R 2
Decision tree 0.75773 0.095029 1.2115 1.1007 0.96783
Random forest 2.4155 0.34576 14.883 3.8579 0.71024
XGBoost 1.2619 0.15891 4.4908 2.1191 0.91257
LSBoost 0.63829 0.074194 1.1835 1.0879 0.97696
SVM 4.8909 0.73553 36.035 6.0029 0.29844
GKR 4.4492 0.61383 30.904 5.5591 0.398346


Based on the metrics from Table 6 and the visualizations in Fig. 10, the six algorithms exhibit varying performance in predicting Poisson's ratio. The bar chart in Fig. 10a demonstrates that LSBoost leads in MAE (0.63829) and RMSE (1.0879), highlighting its significant advantage in terms of accuracy and error magnitude. The decision tree follows closely, indicating its strong performance as well. The scatter plot (Fig. 10b) reveals the relationship between R2 and MAE, where the decision tree and XGBoost excel, showing higher R2 values that indicate better fit and predictive accuracy in capturing the complex nonlinear relationships of Poisson's ratio. In contrast, SVM and GKR have lower R2 values, at 0.2984 and 0.39834 respectively, suggesting weaker model fit and predictive capability. The radar chart (Fig. 10c) further assesses the performance of each algorithm across multiple metrics. LSBoost performs well in most dimensions, demonstrating superior performance in predicting Poisson's ratio. XGBoost also does well but slightly lags behind the decision tree in various dimensions. The Taylor diagram (Fig. 10d) compares models in terms of correlation and standard deviation. LSBoost has the highest correlation coefficient, close to 1, and the lowest standard deviation, indicating high consistency and stability of its predictions with actual values. XGBoost also shows high correlation and low standard deviation. In comparison, GKR has the highest standard deviation, indicating poorer stability in its predictive outcomes.


image file: d5tc03267a-f10.tif
Fig. 10 Visualized evaluation metrics for prediction of electrical resistance under 20 °C. (a) Bar chart of MAE, MAPE, and RMSE values for different algorithms. (b) Scatter plot depicting the relationship between R2 and MAE for the same set of algorithms. (c) Radar chart of prediction performance for different algorithms. (d) Taylor diagrams comparing models across correlation and standard deviation metrics.

In summary, the LSBoost algorithm outperforms in the task of predicting electrical resistance at 20 °C, demonstrating clear advantages in both error metrics and fit. The decision tree and XGBoost also show strong performance, albeit slightly inferior to LSBoost. In contrast, SVM and GKR fall short, particularly in terms of model fit and predictive accuracy. These results not only reveal significant performance differences among algorithms when dealing with specific types of data but also underscore the importance of selecting the right algorithm to enhance predictive accuracy. The performance differences among algorithms in predicting electrical resistance at 20 °C can be attributed to variations in model structure, feature processing capabilities, and generalization ability. For instance, LSBoost effectively handles complex data patterns through its stepwise optimization process, leading to superior performance in predicting electrical resistance. In contrast, SVM and GKR may lack the flexibility to deal with non-linear relationships, resulting in less effective performance on this specific dataset. Additionally, the feature distribution and complexity of the dataset can influence algorithm performance, further exacerbating these differences.


3.2.1.5 Prediction of electrical resistance under 100 °C. Based on comprehensive assessments from training and test datasets, Fig. 11 illustrates the performance of six traditional machine learning algorithms in predicting electrical resistance at 100 °C. Subfigures correspond to decision tree, random forest, XGBoost, LSBoost, SVM, and GKR algorithms. Each subplot is a scatter plot, displaying the relationship between predicted and actual values. It is evident that random forest, XGBoost, and LSBoost exhibit higher R2 values on both training and test datasets, indicating better fit and predictive accuracy in electrical resistance at 100 °C. In contrast, SVM and GKR have lower R2 values on the test set, at 0.36106 and 0.6830 respectively, showing relatively weaker predictive performance. Detailed quantitative evaluation metrics for each algorithm are listed in Table 7. The optimal hyperparameters of different algorithms are found within 160 iterations through Bayesian optimization and are listed in Table S6 in the SI. The histogram of residual distributions for various machine learning models to predict the electrical resistance at 100 °C are presented in Fig. S5 in the SI.
image file: d5tc03267a-f11.tif
Fig. 11 Prediction performance of traditional machine learning models for electrical resistance at 100 °C. Scatter plots illustrate the predicted versus actual values for different algorithms: (a) decision tree, (b) random forest, (c) XGBoost, (d) LSBoost, (e) SVM, (f) GKR.
Table 7 Evaluation metrics of the algorithms used to predict the electrical resistance under 100 °C
Model MAE MAPE MSE RMSE R 2
Decision tree 0.71585 0.10536 1.2247 1.1067 0.83216
Random forest 0.49074 0.059917 0.63608 0.79754 0.91283
XGBoost 0.40871 0.055219 0.047559 0.68963 0.93482
LSBoost 0.41942 0.050811 0.55536 0.74523 0.92389
SVM 1.7456 0.24834 4.6624 2.1593 0.36106
GKR 1.1257 0.1557 2.3132 1.5209 0.683


Visualizations clearly reveal differences in error control among algorithms and illustrate the performance of six machine learning algorithms in predicting electrical resistance at 100 °C. The bar chart (Fig. 12a) clearly reveals differences in error control among algorithms. XGBoost performs best in MAE (0.40871) and RMSE (0.68963), while SVM exceeds chart limits in both MAE (1.7456) and RMSE (2.1593), marking it as the only model with performance collapse.


image file: d5tc03267a-f12.tif
Fig. 12 Visualized evaluation metrics for prediction of electrical resistance under 100 °C. (a) Bar chart of MAE, MAPE, and RMSE values for different algorithms. (b) Scatter plot depicting the relationship between R2 and MAE for the same set of algorithms. (c) Radar chart of prediction performance for different algorithms. (d) Taylor diagrams comparing models across correlation and standard deviation metrics.

The scatter plot (Fig. 12b) reveals the relationship between R2 and MAE. XGBoost (orange dot) occupies the top right golden area of the chart (R2 = 0.93482, MAE = 0.40871), achieving a perfect balance of high precision and low error. LSBoost (purple dot), with R2 = 0.92389 and MAE = 0.41942, is closely positioned next to it, forming a high-density performance cluster. In contrast, random forest (blue dot), though acceptable in R2 (0.91283), has a significantly higher MAE (0.49074), falling into the “medium precision–medium error” range; SVM (red dot) is isolated in the lower left corner with R2 (0.36106) and MAE (1.7456), becoming an outlier with both metrics at a disadvantage.

The radar chart (Fig. 12c) highlights the intrinsic differences in comprehensive performance of algorithms. The contour of XGBoost approaches the center point most compactly, with line lengths in dimensions like MAE (0.40871) and RMSE (0.68963), especially outstanding on the R2 axis (0.93482). LSBoost, though slightly weaker on the MAPE axis (0.050811) compared to XGBoost (0.055219), maintains an overall balanced shape, significantly better than random forest's star-shaped distortion. In contrast, SVM expands extremely in all dimensions, with the RMSE axis length (4.6624) reaching 98 times that of XGBoost (0.047559), exposing prediction volatility flaws.

The Taylor diagram (Fig. 12d) compares models in terms of correlation and standard deviation. LSBoost has the highest correlation coefficient, close to 1, and the lowest standard deviation, indicating high consistency and stability of its predictions. XGBoost also shows high correlation and low standard deviation.

In summary, XGBoost and LSBoost algorithms demonstrate superior performance in predicting electrical resistance at 100 °C, with clear advantages in error metrics and fit. Their success stems from their optimized model structures and feature processing, which enable effective capture of complex data patterns and a balanced trade-off between model generalization and predictive accuracy. Conversely, SVM and GKR's limited flexibility in handling non-linear relationships leads to weaker performance. The feature distribution and complexity of the dataset also play a role in influencing algorithm performance, further widening the performance gap. These findings underscore the critical importance of algorithm selection in enhancing predictive accuracy for specific data.


3.2.1.6 Prediction of electrical resistance under 150 °C. Based on comprehensive assessments from training and testing datasets, Fig. 13 illustrates the performance of six traditional machine learning algorithms in predicting the electrical resistance of halide glass at 150 °C. Clearly, XGBoost and LSBoost exhibit higher R2 values (greater than 0.85) on both training and testing datasets, indicating better fit and predictive accuracy for electrical resistance at 150 °C. In contrast, decision tree and random forest have intermediate R2 values on the testing set, while SVM and GKR have the lowest R2 values, at 0.44189 and 0.59128 respectively, suggesting relatively weaker predictive performance. Table 8 lists detailed quantitative evaluation metrics for each algorithm. The histogram of residual distributions for various machine learning models to predict the electrical resistance at 150 °C are presented in Fig. S6 in the SI. The optimal hyperparameters of different algorithms, found through Bayesian optimization over 140 iterations, are listed in Table S7 in the SI. We have detailed the search space for different algorithms and the criteria for selecting the optimal parameters. The search space includes the range of values for each hyperparameter, such as the minimum number of leaf nodes in decision trees and the number of trees in a random forest. The criteria for selecting the optimal parameters are based on the model's performance evaluation on the validation set, with the goal of maximizing metrics such as accuracy, precision, recall, and the coefficient of determination.
image file: d5tc03267a-f13.tif
Fig. 13 Prediction performance of traditional machine learning models for electrical resistance at 150 °C. Scatter plots illustrate the predicted versus actual values for different algorithms: (a) decision tree, (b) random forest, (c) XGBoost, (d) LSBoost, (e) SVM, (f) GKR.
Table 8 Evaluation metrics of the algorithms used to predict the electrical resistance under 150 °C
Model MAE MAPE MSE RMSE R 2
Decision tree 0.96867 0.19652 1.2076 1.0989 0.80959
Random forest 0.69399 0.093582 1.7847 1.3359 0.71861
XGBoost 0.48629 0.073801 0.57371 0.75744 0.90954
LSBoost 0.60325 0.09348 0.76538 0.87486 0.87932
SVM 1.5012 0.23148 3.5398 1.8814 0.44189
GKR 1.1983 0.18921 2.5923 1.6101 0.59128


The performance of six machine learning algorithms in predicting electrical resistance at 150 °C was depicted by various visualization methods. The bar chart (Fig. 14a) shows the MAE and RMSE values for different algorithms. XGBoost performs best in MAE (0.48629) and RMSE (0.75744), while SVM shows the worst performance in these metrics, with an MAE of 1.5012 and RMSE of 1.8814. This indicates that XGBoost has better fit and predictive accuracy in electrical resistance prediction tasks, whereas the predictive performance of SVM is relatively weak.


image file: d5tc03267a-f14.tif
Fig. 14 Visualized evaluation metrics for prediction of electrical resistance under 150 °C. (a) Bar chart of MAE, MAPE, and RMSE values for different algorithms. (b) Scatter plot depicting the relationship between R2 and MAE for the same set of algorithms. (c) Radar chart of prediction performance for different algorithms. (d) Taylor diagrams comparing models across correlation and standard deviation metrics.

The scatter plot (Fig. 14b) reveals the relationship between R2 and MAE. XGBoost and LSBoost not only excel in MAE but also have higher R2 values, at 0.90954 and 0.87932 respectively, showing better fit and predictive accuracy. In contrast, decision tree and random forest, despite having high R2 values, have relatively high MAE values, indicating potential larger prediction biases in certain situations.

The radar chart (Fig. 14c) further evaluates the performance of each algorithm across multiple metrics. XGBoost performs well in most dimensions, demonstrating superior performance in predicting electrical resistance. LSBoost also does well but slightly lags behind XGBoost in the MAPE dimension. SVM performs poorly across all dimensions, showing limitations when dealing with this dataset.

The Taylor diagram (Fig. 14d) compares models in terms of correlation and standard deviation. XGBoost has the highest correlation coefficient, close to 1, and the lowest standard deviation, indicating high consistency and stability of its predictions. LSBoost also shows high correlation and low standard deviation.

In summary, XGBoost and LSBoost algorithms demonstrate the best comprehensive performance in predicting electrical resistance at 150 °C, with clear advantages in error metrics and fit. XGBoost, through gradient boosting, effectively captures complex non-linear relationships, achieving a balance of high precision and low error. LSBoost also performs well, though slightly behind XGBoost in some metrics. In contrast, SVM and GKR show less satisfactory performance, particularly in model fit and predictive accuracy. SVM, based on a linear model, may underperform due to its limitations in handling complex non-linear relationships, despite its ability to process non-linear data through kernel tricks. GKR, while capable of handling some non-linear relationships, still falls short in model fit and predictive accuracy. Additionally, the feature distribution and complexity of the dataset influence algorithm performance, further exacerbating these differences. These findings underscore the critical importance of selecting the appropriate algorithm to enhance predictive accuracy for specific data.

3.2.2 Deep learning and neural network algorithms. As previously mentioned, deep learning and neural network algorithms can directly complete multi-input multi-output tasks by constructing neural networks, thereby significantly improving computational efficiency. The optimal hyperparameters of different algorithms are listed in Table S8 in the SI.

Fig. 15 illustrates the prediction performance of the BP algorithm for different tasks using scatter plots, including Young's modulus, shear modulus, Poisson's ratio, and electrical resistance at 20 °C, 100 °C, and 150 °C. Each subplot displays the relationship between predicted and actual values, including real values from training and testing datasets, as well as 95% confidence intervals and regression fit lines. The frequency distribution of prediction errors for the BP algorithm is presented in Fig. S7 in the SI. Table 9 lists detailed quantitative evaluation metrics for the BP algorithm across various tasks.


image file: d5tc03267a-f15.tif
Fig. 15 Prediction performance of BP algorithm. Scatter plots illustrate the predicted versus actual values for different tasks: (a) Young's modulus, (b) shear modulus, (c) Poisson's ratio, (d) electrical resistance under 20 °C, (e) electrical resistance under 100 °C, (f) electrical resistance under 150 °C.
Table 9 Evaluation metrics of the BP algorithm when handling different tasks
Task MAE MAPE MSE RMSE R 2
Young's modulus 1.6378 0.0296559 5.8617 2.4211 0.94832
Shear modulus 0.72071 0.03486 1.0403 1.02 0.89891
Poisson's ratio 0.021638 0.070306 0.0010213 0.031958 0.83758
Electrical resistance under 20 °C 0.74349 0.070224 1.2185 1.1039 0.90835
Electrical resistance under 100 °C 0.46165 0.059789 0.49996 0.70708 0.91712
Electrical resistance under 150 °C 0.39296 0.060607 0.30456 0.55187 0.92586


According to the data in Table 9, we can observe the performance of the BP algorithm across different tasks. For instance, in predicting Young's modulus, the BP algorithm achieves an MAE of 1.6378, MAPE of 0.0296559, MSE of 5.8617, RMSE of 2.4211, and R2 of 0.94832, demonstrating high predictive accuracy. In contrast, for predicting Poisson's ratio, the BP algorithm performs more modestly, with an R2 of only 0.83758. In the task of predicting electrical resistance, the BP algorithm has an MAE of 0.74349 at 20 °C with an R2 of 0.90835, and an MAE of 0.39296 at 150 °C with an R2 of 0.92586, showing an increase in error as temperature rises, but still maintaining a high R2 value, indicating good model fit.

Overall, the analysis indicates that the BP algorithm exhibits good adaptability and accuracy in handling prediction tasks for different physical quantities. Despite performance fluctuations under different tasks and temperatures, the BP algorithm generally provides reliable predictive outcomes.

The scatter plot distribution in Fig. 16 indicate that the LSTM algorithm exhibits task dependency in material property prediction. And the quantitative metrics in Table 10 further confirm this observation. The frequency distribution of prediction errors for the LSTM algorithm is presented in Fig. S8 in the SI. In the Young's modulus prediction task (Fig. 16a), the data points are closely distributed around the fit line, with R2 values of 0.93329 for the training set and 0.89608 for the test set, combined with a lower MAPE of 0.0296559 and a low RMSE of 2.4211 from Table 10, indicating excellent stability and generalization capability of the model in predicting this mechanical property. The data distribution is concentrated, and the narrow confidence intervals confirm the algorithm's precise capture of Young's modulus variation patterns.


image file: d5tc03267a-f16.tif
Fig. 16 Prediction performance of LSTM algorithm. Scatter plots illustrate the predicted versus actual values for different tasks: (a) Young's modulus, (b) shear modulus, (c) Poisson's ratio, (d) electrical resistance under 20 °C, (e) electrical resistance under 100 °C, (f) electrical resistance under 150 °C.
Table 10 Evaluation metrics of the LSTM algorithm when handling different tasks
Task MAE MAPE MSE RMSE R 2
Young's modulus 2.9172 0.044864 16.975 4.12 0.89608
Shear modulus 1.6105 0.06411 4.5494 2.1329 0.65368
Poisson's ratio 0.026185 0.087691 0.00093472 0.030573 0.021358
Electrical resistance under 20 °C 1.8092 0.15538 4.0042 2.001 0.1608
Electrical resistance under 100 °C 1.2373 0.14498 2.08 1.4422 0.46592
Electrical resistance under 150 °C 1.204 0.16333 1.7366 1.3178 0.64795


In contrast, shear modulus prediction (Fig. 16b) maintains a certain linear trend but shows increased scatter, with the test set R2 dropping to 0.65368, and higher MAE (1.6105) and RMSE (2.1329) reflecting a decrease in predictive accuracy. Most notably, the failure in Poisson's ratio prediction (Fig. 16c) is evident: the data points are disorderly distributed, the fit line is nearly horizontal, and the confidence interval significantly widens, matching the extremely low R2 value of 0.021358 in Table 10. Although the MAE (0.026185) appears small, the 8.7691% MAPE reveals actual prediction bias due to the small numerical range (0–0.5), indicating the failure to learn effective patterns for Poisson's ratio of the model.

In electrical resistance prediction tasks, temperature effects become a key variable. The prediction of electrical resistance at 20 °C (Fig. 16d) shows highly scattered data points, with the fit line deviating from the ideal position, corresponding to an R2 of 0.1608 and 15.538% MAPE in Table 10, indicating difficulty in capturing electrical resistance change mechanisms at low temperatures. As temperature rises to 100 °C (Fig. 16e), data points start to gather around the fit line, with R2 increasing to 0.46592 and MAE dropping to 1.2373, showing that higher temperatures help improve prediction consistency. At 150 °C (Fig. 16f), performance further optimizes, with R2 peaking at 0.64795, the most concentrated data distribution, and the fit line closest to y = x, yet the high MAPE suggests that relative errors remain high due to the decrease in actual electrical resistance values.

Overall, the LSTM algorithm performs well in predicting Young's modulus and high-temperature (150 °C) electrical resistance but shows significant deficiencies in predicting Poisson's ratio and low-temperature electrical resistance. These performance differences stem from both the complexity of intrinsic patterns in different physical quantities (such as the nonlinear characteristics of Poisson's ratio) and the impact of external conditions (like temperature) on data regularity. Future optimizations should focus on enhancing the ability to represent complex relationships of the algorithm, such as by introducing material microstructural parameters through feature engineering or employing hybrid model architectures to improve modeling accuracy for challenges like low-temperature electrical resistance.

Fig. 17 illustrates the prediction performance of the CNN algorithm across different tasks using scatter plots, including Young's modulus, shear modulus, Poisson's ratio, and electrical resistance at 20 °C, 100 °C, and 150 °C. In the Young's modulus prediction task (Fig. 17a), data points closely align with the fit line with a narrow gray confidence interval, consistent with the high precision metrics in Table 11R2 reaches 0.97581, MAE is 1.0521, and MAPE is as low as 1.78%, indicating the highly accurate prediction of CNN algorithm for this mechanical property. Such concentrated scatter distribution shows the model not only captures the core patterns of Young's modulus but also accurately reproduces subtle fluctuations, demonstrating superior performance over the previous LSTM model. The frequency distribution of prediction errors for the CNN algorithm is presented in Fig. S9 in the SI.


image file: d5tc03267a-f17.tif
Fig. 17 Prediction performance of CNN algorithm. Scatter plots illustrate the predicted versus actual values for different tasks: (a) Young's modulus, (b) shear modulus, (c) Poisson's ratio, (d) electrical resistance under 20 °C, (e) electrical resistance under 100 °C, (f) electrical resistance under 150 °C.
Table 11 Evaluation metrics of the CNN algorithm when handling different tasks
Task MAE MAPE MSE RMSE R 2
Young's modulus 1.0521 0.017845 2.0956 1.4476 0.97581
Shear modulus 0.87758 0.037648 1.0279 1.0139 0.91755
Poisson's ratio 0.029722 0.10016 0.0014237 0.037731 0.13426
Electrical resistance under 20 °C 0.70597 0.061689 0.88768 0.94217 0.57668
Electrical resistance under 100 °C 0.6567 0.069058 0.69198 0.83186 0.73371
Electrical resistance under 150 °C 0.85976 0.11711 1.2705 1.1272 0.65372


Shear modulus prediction (Fig. 17b), though slightly inferior, still shows a clear linear trend with slightly widened confidence intervals but overall controllable. The R2 of 0.91755 and MAE of 0.87758 in Table 11 confirm this robustness, especially the 37.65% MAPE relative value suggesting the model remains reliable for larger numerical samples. Notably, when shifting to Poisson's ratio prediction (Fig. 17c), performance drops significantly: data points are disorderly distributed, the fit line is nearly horizontal, and the gray confidence interval covers half the coordinate area, visually exposing model failure. The extremely low R2 of 0.13426 and 10.02% MAPE in Table 11 together confirm this collapse—although the MAE (0.02972) seems small, considering the typical range of Poisson's ratio (usually <0.5), the actual prediction bias is far beyond acceptable limits.

Electrical resistance prediction exhibits complex temperature effects. The 20 °C scenario (Fig. 17d) shows widely scattered data points, many far from the fit line, with the gray confidence interval opening like a trumpet, corresponding to the mediocre R2 of 0.57668 and 6.17% MAPE in Table 11, revealing the modeling difficulty of material electrical behavior at low temperatures. As temperature rises to 100 °C (Fig. 17e), the scatter cloud noticeably contracts towards the fit line, the confidence interval narrows, and R2 peaks at 0.73371, MAE drops to 0.6567, indicating the presence of more easily captured regularities in the mid-temperature range. However, surprisingly, at 150 °C (Fig. 17f), performance does not continue to improve—data points show a tendency to disperse again, R2 falls back to 0.65372, MAE increases to 0.85976, and MAPE soars to 11.71%, suggesting new complexity factors (such as phase changes or altered electronic migration mechanisms) may emerge at high temperatures, leading to model prediction degradation.

Overall, CNN demonstrates dominant performance in predicting Young's modulus, robust performance in shear modulus and electrical resistance at 100 °C, but significant shortcomings in predicting Poisson's ratio and extreme temperature (especially 150 °C) electrical resistance. These fluctuations highlight the complexity of material property prediction and expose the limitations of a single model architecture—future optimizations may need to introduce attention mechanisms for Poisson's ratio to capture nonlinear associations and integrate physics-inspired feature engineering for high-temperature electrical resistance to achieve breakthroughs in comprehensive prediction scenarios.

Fig. 18 illustrates the prediction performance of the GRNN algorithm across various tasks using scatter plots, including Young's modulus, shear modulus, Poisson's ratio, and electrical resistance at 20 °C, 100 °C, and 150 °C. Each subplot shows the relationship between predicted and actual values, including real values from training and testing datasets, as well as 95% confidence intervals and regression fit lines. The frequency distribution of prediction errors for the GRNN algorithm is presented in Fig. S10 in the SI.


image file: d5tc03267a-f18.tif
Fig. 18 Prediction performance of GRNN algorithm. Scatter plots illustrate the predicted versus actual values for different tasks: (a) Young's modulus, (b) shear modulus, (c) Poisson's ratio, (d) electrical resistance under 20 °C, (e) electrical resistance under 100 °C, (f) electrical resistance under 150 °C.

In the Young's modulus prediction (Fig. 18a), data points are generally distributed along the diagonal line, but the test set (lighter dots) significantly deviates from the fit line, forming a vertically diffuse band. This matches the R2 value of 0.86916 in Table 12—indicating the model captured the basic trend but with systematic prediction bias (MAE of 2.7112 and RMSE of 1.4476). Notably, the confidence interval (light blue band) significantly widens in the higher numerical region, suggesting decreased prediction reliability for samples with high Young's modulus values.

Table 12 Evaluation metrics of the GRNN algorithm when handling different tasks
Task MAE MAPE MSE RMSE R 2
Young's modulus 2.7112 0.054223 11.981 3.4614 0.86916
Shear modulus 1.0662 0.055404 1.8608 1.3641 0.79759
Poisson's ratio 0.0059436 0.025771 4.9438 × 10−5 0.0070312 0.99256
Electrical resistance under 20 °C 3.5352 0.31936 15.626 3.953 0.13629
Electrical resistance under 100 °C 2.3131 0.28665 6.6639 2.5814 0.082514
Electrical resistance under 150 °C 1.7825 0.26356 3.947 1.9867 0.045296


The prediction of shear modulus (Fig. 18b) shows a more concentrated scatter distribution, with training and testing fit lines (dark/light lines) nearly overlapping, corresponding to the robust R2 value of 0.79759 in Table 12. However, some high-value samples (upper right corner) still deviate from the diagonal line, resulting in a higher MAE (1.0662) compared to the Poisson's ratio task, reflecting slight prediction inaccuracies for extreme shear modulus values.

The prediction of Poisson's ratio (Fig. 18c) demonstrates impressive prediction accuracy, with all data points almost perfectly aligned along the fit line, forming a straight, dense array of points. This perfect fit is confirmed by the extremely high metrics in Table 12R2 equals to 0.99256, MAE of only 0.0059, with errors less than 2.6% of the true values (MAPE = 0.025771), indicating the unique advantage for such small numerical range physical quantities of GRNN.

Electrical resistance prediction shows temperature-related anomalies. At 20 °C (Fig. 18d), 100 °C (Fig. 18e), and 150 °C (Fig. 18f), R2 of the algorithm is less than 0.2, exposing complete model failure, meaning the radial basis function cannot handle the highly nonlinear response of electrical resistance near phase transition temperatures.

Overall, GRNN demonstrates dominant performance in predicting Poisson's ratio but only achieves moderate levels in Young's modulus and shear modulus, and even encounters systematic failure in electrical resistance tasks—especially the “prediction disability”, highlighting the serious adaptation deficiencies for temperature-sensitive physical quantities of GRNN. These extreme performance fluctuations warn us that material property prediction requires tailored model architectures for different physical mechanisms, and a single algorithm cannot universally fit all scenarios.

Overall, the BP algorithm excels in multi-task prediction, demonstrating good adaptability and optimization capabilities. LSTM, CNN, and GRNN algorithms have shortcomings in certain tasks. The BP algorithm optimizes neural network weights through backpropagation, which enables it to better adapt to multi-task prediction needs. The LSTM algorithm relies on temporal dependencies in time series data and may underperform in non-time series tasks due to the lack of a temporal dimension. The CNN algorithm extracts local features through convolutional layers and is suitable for data with spatial local correlation. Its performance in predicting material properties is influenced by the spatial structure of the data. The GRNN algorithm has high prediction accuracy for physical quantities with small numerical ranges but may fail when dealing with highly nonlinear relationships. The principles and structural characteristics of these algorithms determine their applicability and performance differences in different tasks. Future research should focus on developing task-specific hybrid architectures to enhance prediction accuracy in multi-task scenarios.

3.3 Interpreting the induced models

As stated above, we utilized SHAP analysis to examine the optimal models we constructed and to investigate the impact of each input feature on the predicted properties. It is important to note that while this approach was not actually employed in our study, integrating physical constraints into models is a potential method that could aid in tracking the decision-making process and enhancing the interpretability of models. This method involves including constraints related to physical laws during the model training process, which theoretically could ensure that the model's predictions are not only consistent with the data but also aligned with established physical principles, thereby making the model's decision logic clearer and more traceable. This possibility presents a valuable direction for future research. In Young's modulus prediction (Fig. 19a), the average second-level ionization energy (ionization_energy2_avg) occupies the high SHAP value area (>0.5) with a steep “cliff-like” point cluster, indicating that an increase of one unit in second-level ionization energy can significantly increase Young's modulus—a quantum-level control of material stiffness by inner electron binding. In contrast, the “sandwich-like” distribution of atomic volume deviation (atomic_volume_dev) shows that when the standard deviation of atomic size is within a certain optimized range, lattice distortion may trigger solid-solution strengthening; however, deviations exceeding can lead to structural instability. It suggests that future optimization of halide glass Young's modulus may be achieved by altering atomic size.
image file: d5tc03267a-f19.tif
Fig. 19 SHAP value analysis of elastic properties using a bee swarm plot. The plot illustrates the impact of various features on the prediction of (a) Young's modulus, (b) shear modulus, and (c) Poisson's ratio.

Shear modulus prediction (Fig. 19b) reveals a more complex multi-scale coupling mechanism. The average third-level ionization energy (ionization_energy3_avg) covers a big SHAP span, with its purple high-value points right-biased feature indicating that third-level ionization energy can enhance deep electron binding, thus increasing shear modulus. The average heat of formation (heat_of_formation_avg), however, exposes a non-traditional pattern—dense blue low values (stable compounds) dominate high SHAP positions, forming a decrease in heat that can enhance shear modulus, possibly due to the quantum locking effect of strong bonding networks against dislocation. Meanwhile, the radiating point cluster of atomic volume deviation (Atomic_volume_dev) shows that high purple deviation values induce local lattice distortions (such as grain boundaries) to create shear deformation barriers, but excessive dispersion leads to a cluster of blue low-value points, corresponding to shear instability risks.

Poisson's ratio prediction (Fig. 19c) hinges on the delicate balance between electron cloud deformation and bond strength. The average dipole polarizability (dipole_polarizability_avg) forms a flame-like gradient: high polarizability shoots directly to high SHAP values, possibly indicating that electron cloud flexibility acts as a quantum switch for lateral expansion. In contrast, the average first-level ionization energy (ionization_energy1_avg) constructs an “arch-like” distribution: SHAP peak values appear in the medium bond strength range (moderately bound outer electrons), with very high or very low ionization energies both diminishing contributions, revealing the “golden bond strength interval” for Poisson's ratio optimization.

Fig. 20 uses bee swarm plots to illustrate SHAP value analysis for electrical resistance prediction under different temperature conditions. Each subplot reveals the impact of various features on electrical resistance prediction outcomes.


image file: d5tc03267a-f20.tif
Fig. 20 SHAP value analysis of electrical resistance using a bee swarm plot. The plot illustrates the impact of various features on the prediction of electrical resistance at (a) 20 °C, (b) 100 °C, and (c) 150 °C. (d) Explanation of the output for a random instance of electrical resistance at 150 °C using the SHAP method by the force plot.

In the electrical resistance prediction at 20 °C (Fig. 20a), the heat of formation deviation (heat_of_formation_dev) and average heat of formation (heat_of_formation_avg) are the two most significant features affecting the prediction of electrical resistance. The high SHAP value of heat of formation deviation indicates that local compositional changes in materials, such as elemental segregation, may increase electrical resistance by enhancing electron scattering. In contrast, the negative SHAP value of the average heat of formation suggests that stronger chemical bonds may reduce electrical resistance by limiting the free movement of electrons.

As the temperature rises to 100 °C (Fig. 20b), the most influential features for electrical resistance prediction include heat of formation deviation (heat_of_formation_dev), average heat of formation (heat_of_formation_avg), atomic volume deviation (atomic_volume_dev), and atomic weight deviation (atomic_weight_dev). The heat of formation deviation and average heat of formation remain highly important, indicating that even at moderate temperatures, the chemical bonding characteristics of materials significantly affect electrical resistance. The high SHAP value of atomic volume deviation suggests that microstructural changes in materials, such as lattice distortion, have a significant impact on electrical resistance. The positive contribution from atomic weight deviation may imply that heavier atoms could lead to denser electron scattering, thereby increasing the electrical resistance.

In the high-temperature environment of 150 °C (Fig. 20c), the features that most influence electrical resistance prediction have shifted. Although heat of formation deviation (heat_of_formation_dev) and average heat of formation (heat_of_formation_avg) remain significant, their influence is relatively reduced. In contrast, atomic volume deviation (atomic_volume_dev) and boiling point deviation (boiling_point_dev) become more important, indicating that at high temperatures, the microstructural and phase transition characteristics of materials have a more pronounced impact on electrical resistance. Specifically, an increase in atomic volume deviation may imply a higher degree of lattice distortion, potentially leading to increased electron scattering and thus higher electrical resistance. An increase in boiling point deviation might suggest a decrease in material phase stability, as materials are more likely to undergo phase transitions at high temperatures, and such structural changes directly affect the electronic transport properties of the material, subsequently influencing electrical resistance. Additionally, dipole polarizability deviation (dipole_polarizability_dev) also shows some influence, which may be related to changes in the electronic structure of materials at high temperatures. The changes brought about by the increase in temperature are mainly reflected in the physical properties of materials, such as atomic volume and boiling point, which have a more direct impact on electrical resistance at high temperatures. Some features that are important at lower temperatures, such as heat of formation, see a decrease in their influence, possibly because the chemical bonding characteristics of materials have a less significant impact on electrical resistance compared to physical properties at high temperatures.

It is worth noting that, in addition to qualitative analysis, SHAP analysis provides detailed quantitative analysis for each individual case through force plots (Fig. 20d). Due to space limitations, we selected a random instance under conditions of 150 °C for SHAP analysis. The analysis indicates that the model's prediction for the current sample is 7.91, which is based on the contributions of various features. Starting from the base value in the figure, the red bars above represent features such as atomic_volume_dev = 2785.2 and atomic_weight_dev = 1245.6, which have a significant positive impact on the prediction outcome and are the main drivers that push the predicted value well above the base value. In contrast, the blue bar below, such as ionization_energy2_avg = 551.2, has a negative effect on the predicted value, but its influence is relatively minor. Thus, the figure clearly shows that the main reason the sample received a higher predicted value is due to the notable positive contribution of attributes like atomic volume and weight deviation. Through force plots, SHAP analysis provides us with a powerful and intuitive tool for gaining a deeper understanding of the impact of each variable on the output.

In summary, through SHAP analysis, the constructed optimal models and investigated the contribution of each input feature to the predicted properties were assessed. The results of SHAP analysis reveal which features are most influential for predicting the electrical resistance of halide glasses, thus providing a scientific basis for subsequent material design and optimization. Understanding the impact of these features can help researchers adjust material compositions and processing conditions more specifically to achieve desired properties. Additionally, this analytical approach can make guidance for future experimental designs, reducing unnecessary trial-and-error attempts and increasing research efficiency.

3.4 Challenges and limitations

Despite the significant potential of machine learning in predicting the elastic properties and electrical resistance of halide glasses, its application still faces several challenges and limitations. These include data availability and quality, difficulties in feature selection, lack of model interpretability, and the need for improved algorithms.

Firstly, the availability and quality of data are fundamental limiting factors. Acquiring high-quality experimental data on halide glasses is constrained by stringent synthesis conditions and complex performance characterizations, resulting in limited and scattered existing datasets. Future efforts should focus on establishing cross-institutional shared databases, integrating high-throughput computations, such as first-principles simulations, with standardized experimental data, and introducing data augmentation methods like generative adversarial networks (GANs) to fill data gaps. Specifically, the types of data should encompass chemical composition, thermodynamic properties, mechanical properties, and electrical properties. Data platforms could be open-access scientific databases, such as materials project or the material measurement laboratory (MML) by NIST, which offer standardized data formats and interfaces. Furthermore, standardizing experimental data implies the necessity to define clear experimental protocols and data recording standards to ensure data consistency and comparability. Compared to traditional descriptors, these descriptors can more precisely capture the microstructural and performance relationships of materials, thereby enhancing the predictive accuracy and generalization capability of models. In terms of high-throughput computations, this may involve automated software pipelines or collaborative frameworks, such as utilizing high-performance computing (HPC) for large-scale parallel computing, employing cloud services like AWS to support computational needs, and adopting version control and collaboration platforms like Git and GitHub to manage code and data.

Secondly, feature engineering faces multi-scale coupling challenges. The elastic modulus and electrical resistance of halide glasses are influenced by multiple factors at different scales, including chemical composition, microstructure (such as phase separation), and defect states (like vacancy cluster concentration). Traditional feature selection methods struggle to capture key descriptors, such as polyhedral distortions of anions, while automated feature extraction heavily relies on crystallographic expertise. The breakthrough lies in developing material informatics-specific descriptor libraries and applying graph neural networks to directly process topological data of atomic structures, transitioning from “manual feature design” to “intrinsic feature mining”. In this envisioned approach, the use of structural data that includes both experimentally determined structures and those generated through simulation models would aid in more comprehensively understanding and predicting material properties. However, we also recognize the need to balance the use of experimental and simulated data to avoid issues that may arise from an over-reliance on simulated data. Ensuring the diversity and reliability of data sources is crucial for enhancing the accuracy and generalizability of the models. Moreover, knowledge graphs, such as materials knowledge graph (MKG) and material property knowledge graph (MPKG), can provide a structured representation of the relationships between these factors and material properties, helping researchers visualize multi-scale interactions more intuitively and thereby enhance feature selection and improve the overall performance of machine learning models.

Thirdly, the lack of model interpretability hinders the understanding of underlying mechanisms. Although deep learning models have achieved an R2 > 0.85 in predicting Young's modulus, their “black box” nature makes the decision logic untraceable. To address this issue, solutions involve integrating physics-informed machine learning, embedding constitutive equations as constraints within the loss function, and combining SHAP analysis with molecular dynamics simulations to interpret the atomic-scale contributions of features. For instance, by incorporating physical laws as constraints in the model training process, predictions are ensured to not only fit the data well but also adhere to established physical principles. This approach has been demonstrated to enhance the accuracy and reliability of predictions across various material science applications. For example, when predicting the mechanical properties of polymers, models constrained by the relationship between stress and strain have shown to yield results that are more consistent with physical laws, thereby improving the interpretability of the model and the confidence of the researchers in its predictions. Furthermore, methods such as layer-wise relevance propagation or counterfactual explanations can further enhance model interpretability by providing a detailed account of each feature's specific contribution during the decision-making process of the model. These approaches help to demystify the inner workings of deep learning models, allowing researchers to comprehend how the models arrive at particular predictions, thereby increasing trust and acceptance of the models.

Lastly, there is a lack of adaptability in algorithms for dynamic response modeling, which struggle to accurately capture the nonlinear performance responses of halide glasses to temperature changes, especially in regions sensitive to phase transitions. To tackle this challenge, future research could focus on developing learning frameworks that adapt to temperature variations, such as transformer-based temperature-controlled attention mechanisms. These mechanisms dynamically adjust feature weights across different temperature zones using self-attention and multi-head attention, and introduce multi-objective Bayesian optimization to balance conflicts between performance objectives, thereby enabling more precise predictions of material properties. Compared to traditional methods, these transformer-based models exhibit greater flexibility in handling the impact of temperature changes on material performance, offering more refined predictions and capturing more complex nonlinear relationships. Furthermore, through multi-objective Bayesian optimization, they can find the optimal balance between conflicting performance objectives, such as the trade-off between hardness and toughness, and the balance between transparency and mechanical strength, ultimately constructing a four-dimensional mapping network that encompasses “composition–structure–temperature–performance.”

In summary, despite the significant potential of machine learning in predicting the elastic properties and electrical resistance of halide glasses, its application still faces several challenges and limitations, including data availability and quality, difficulties in feature selection, lack of model interpretability, and the need for improved algorithms. Future research should focus on building cross-institutional shared databases, developing material informatics-specific descriptor libraries, integrating physics-guided machine learning, and designing temperature-adaptive learning frameworks. Through interdisciplinary collaboration and the use of knowledge graphs, we can transition from “manual feature design” to “intrinsic feature mining,” enhancing model interpretability and adaptability. Only through an integrated “data-algorithm-physics” solution can machine learning drive the evolution of halide glass design from empirical trial-and-error to mechanism-driven development, accelerating the advancement of key materials.

4 Conclusions

This study successfully established a predictive framework for the elastic properties and electrical resistance of halide glasses using machine learning methods. By collecting and preprocessing a large amount of experimental data, and combining feature selection and data augmentation techniques, a high-quality dataset was constructed, providing a solid foundation for model training. In terms of algorithm selection, the performance of various traditional machine learning algorithms (decision tree, random forest, XGBoost, LSBoost, SVM, GKR) and deep learning algorithms (BP, CNN, LSTM, GRNN) was comprehensively evaluated. It was found that different algorithms performed differently in various tasks. Specifically, the random forest algorithm performed excellently in predicting Young's modulus (R2 = 0.96146). The LSBoost algorithm was superior in predicting shear modulus (R2 = 0.94357). The decision tree algorithm was outstanding in predicting Poisson's ratio (R2 = 0.96783). In predicting electrical resistance, LSBoost achieved the best performance in predicting electrical resistance at 20 °C (R2 = 0.97696), while XGBoost performed best in predicting electrical resistance at 100 °C and 150 °C (R2 = 0.93482 and R2 = 0.90954, respectively). In addition, the BP neural network demonstrated good adaptability in multi-task prediction, although LSTM, CNN, and GRNN showed limitations in some tasks. Through SHAP analysis, this study thoroughly investigated the contributions of various input features to the prediction outcomes, revealing the key factors influencing the properties of halide glasses. Future research should focus on constructing cross-institutional shared databases, developing feature libraries specific to materials informatics, introducing physics-guided machine learning methods, and designing temperature-adaptive learning frameworks to overcome current limitations. These measures will promote the evolution of halide glass design from empirical trial-and-error to mechanism-driven approaches, accelerating the development of key materials.

Conflicts of interest

The authors declare that they have no conflict of interest.

Data availability

Supplementary information is available. See DOI: https://doi.org/10.1039/d5tc03267a.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (NSFC62575145), Natural Science Foundation of Jiangsu Higher Education Institutions of China (grant number 23KJA510005).

References

  1. M. Beekman and D. G. Cahill, Cryst. Res. Technol., 2017, 52, 1700114 CrossRef.
  2. T. Zhu and E. Ertekin, Energy Environ. Sci., 2019, 12, 216–229 RSC.
  3. L. Zhang, F. Guan, L. Zhang and Y. Jiang, Opt. Mater. Express, 2022, 12, 1683–1707 CrossRef CAS.
  4. A. Singh, D. Dayton, D. M. Ladd, G. Reuveni, P. Paluch, L. Montagne, J. Mars, O. Yaffe, M. Toney and G. Manjunatha Reddy, J. Am. Chem. Soc., 2024, 146, 25656–25668 CrossRef CAS.
  5. C. Fourmentin, J. Guichard, G. Druart, F. D. L. Barriere, X. H. Zhang, M. Roze, F. Charpentier, R. Proux, Y. Guimond and L. Calvez, Adv. Funct. Mater., 2024, 34, 2312275 CrossRef CAS.
  6. X. Xu, Y. M. Xie, H. Shi, Y. Wang, X. Zhu, B. X. Li, S. Liu, B. Chen and Q. Zhao, Adv. Mater., 2024, 36, 2303738 CrossRef CAS PubMed.
  7. Q. Luo, F. Chen, Q. Nie, X. Shen, C. Lin, S. Dai and T. Xu, Opt. Express, 2025, 33, 17312–17324 CrossRef.
  8. Z. Dehghani, M. Nadafan, M. M. Shamloo, Z. Shadrokh, S. Gholipour, M. R. Manshadi, S. Darbari and Y. Abdi, Opt. Laser Technol., 2022, 155, 108352 CrossRef CAS.
  9. Y. Xu, Z. Li, G. Peng, F. Qiu, Z. Li, Y. Lei, Y. Deng, H. Wang, Z. Liu and Z. Jin, Adv. Opt. Mater., 2023, 11, 2300216 CrossRef CAS.
  10. S. Garain, D. He, H. Monluc, H. Dammak and J. Bai, Small, 2025, 2504787 CrossRef CAS.
  11. A. Annunziato, F. Anelli, P. L. P. Du Teilleul, S. Cozic, S. Poulain and F. Prudenzano, Opt. Express, 2022, 30, 44160–44174 CrossRef CAS PubMed.
  12. K. Grebnev, B. Perminov, T. T. Fernandez, A. Fuerbach and M. Chernysheva, APL Phontonics, 2024, 9, 110901 CrossRef CAS.
  13. J. Liu, K. Li, Z. Zhao, H. Chen, J. Wei, W. Zhang, S. Cao, Z. Chang, X. Xiao and Z. Deng, Laser Photonics Rev., 2024, 18, 2400305 CrossRef CAS.
  14. R. Zhang, Z. Zhou, X. Li, T. Pang, T. Song, H. Wu, Q. Liao, Z. Wang, F. Huang and K. Wu, ACS Nano, 2025, 19, 14318–14329 CrossRef CAS PubMed.
  15. H. Z. Aslam, J. T. Doane, M. T. Yeung and G. Akopov, ACS Appl. Opt. Mater., 2023, 1, 1898–1921 CrossRef CAS.
  16. A. Singh, Y. Kim, R. Henry, H. Ade and D. B. Mitzi, J. Am. Chem. Soc., 2023, 145, 18623–18633 CrossRef CAS PubMed.
  17. M. Cui, A. Yang, M. Sun, H. Lin, H. Lin, J. Ren and Z. Yang, J. Am. Ceram. Soc., 2021, 104, 5593–5598 CrossRef CAS.
  18. Y. Wang and S. Dai, PhotoniX, 2021, 2, 9 CrossRef.
  19. M. Dada, P. Popoola, N. Mathe, S. Adeosun and S. Pityana, Int. J. Lightweight Mater. Manuf., 2021, 4, 339–345 CAS.
  20. D. Xu, T. Harvey, E. Begiristain, C. Domínguez, L. Sánchez-Abella, M. Browne and R. B. Cook, J. Mech. Behav. Biomed. Mater., 2022, 133, 105329 CrossRef CAS PubMed.
  21. J.-S. Kim, Y. J. Kim, D. Han, K.-W. Nam, G. Kwon, T. W. Heo, H.-G. Jung, K. J. Yoon and H. Kim, J. Mater. Chem. A, 2023, 11, 7457–7467 RSC.
  22. F. Kargar and A. A. Balandin, Nat. Photonics, 2021, 15, 720–731 CrossRef CAS.
  23. T. Li, F. Li, X. Liu, V. V. Yakovlev and G. S. Agarwal, Optica, 2022, 9, 959–964 CrossRef CAS PubMed.
  24. M. Merklein, I. V. Kabakova, A. Zarifi and B. J. Eggleton, Appl. Phys. Rev., 2022, 9, 041306 CAS.
  25. C. Ye, C. Zhang, J. Zhao and Y. Dong, J. Mater. Eng. Perform., 2021, 30, 6407–6425 CrossRef CAS PubMed.
  26. S. Nam, J. Kim, H. Kim, S. Ahn, S. Jeon, Y. Choi, B. K. Park and W. Jung, Adv. Mater., 2024, 36, 2307286 CrossRef CAS.
  27. S. Jahn, Rev. Mineral. Geochem., 2022, 87, 193–227 CrossRef CAS.
  28. S. Urata, H. Hijiya, K. Niwano and J. Matsui, J. Am. Ceram. Soc., 2022, 105, 4200–4207 CrossRef CAS.
  29. E.-S. R. Khattab, S. S. Abd El Rehim, W. M. Hassan and T. S. El-Shazly, ACS Omega, 2021, 6, 30061–30068 CrossRef CAS PubMed.
  30. G. Shan, H. Xu and Q. Chen, Ceram. Int., 2023, 49, 6790–6804 CrossRef CAS.
  31. D. Aghabalyan, H. Ghorbani and R. Rituraj, Relationship of medicine and philosophy: mathematical modeling of moral structures-etometry, 2023.
  32. H. Ghorbani, S. Asadi, S. Ghorbani, P. Ghorbani, H. Stepanyan, N. Khlghatyan, D. Aghabalyan, S. Bahrami and R. Rituraj, Investigating the predictive contribution of attitude towards life and belief system on self-resilience and psychological toughness of cancer patients about the mediating role of emotion regulation, 2023.
  33. H. Ding, C. Wang, H. Ghorbani, S. Yang, H. Stepanyan, G. Zhang, N. Zhou and W. Wang, Heliyon, 2024, 10, e32127 CrossRef CAS PubMed.
  34. H. Ghorbani, A. Minasyan, D. Ansari, P. Ghorbani, D. A. Wood, R. Yeremyan, S. Ghorbani and N. Minasian, Front. Pharmacol., 2024, 15, 1506437 CrossRef CAS.
  35. H. Ghorbani, A. Chalabyan, A. Minasyan, P. Ghorbani, D. A. Wood, S. Ghorbani, R. Yeremyan, A. Manasyan and F. Zehra, J. Pharm. Innovation, 2025, 20, 94 CrossRef.
  36. J. Rezaei, R. Azouji, M. Azouji and H. Ghorbani, Front. Public Health, 2025, 13, 1526586 CrossRef.
  37. G. Zhang, R. Chen, H. Ghorbani, W. Li, A. Minasyan, Y. Huang, S. Lin and M. Shao, Bioeng. Transl. Med., 2025, 10, e10752 CrossRef PubMed.
  38. M. Voskanyan, H. Ghorbani and M. Azodinia, An investigation of the hydrochemical parameters for natural monuments, 2024 Search PubMed.
  39. M. Voskanyan, H. Ghorbani and R. Azodinia, Utilizing Citizen-Driven Scientific Endeavors for Freshwater Pollution Surveillance: A case report of Lake Sevan, Armenia, 2024 Search PubMed.
  40. X. Lu, Z. Wang, M. Zhao, S. Peng, S. Geng and H. Ghorbani, Water Resour. Manage., 2025, 39, 3521–3536 CrossRef.
  41. H. Ghorbani and A. Abad, Bull. High Technol., 2022, 1, 10–18 Search PubMed.
  42. O. Hazbeh, H. Ghorbani, O. Molaei, M. Shayanmanesh, S. Lajmorak, R. Rituraj and S. Bahrami, Proposing a new model for estimation of oil rate passing through wellhead chokes in an Iranian heavy oil field, 2022 Search PubMed.
  43. P. S. Tehrani, H. Ghorbani, S. Lajmorak, O. Molaei, A. E. A. Radwan and S. P. Ghaleh, AIMS Geosci., 2022, 579–592 CAS.
  44. S. Beheshtian, S. K. Roodbari, H. Ghorbani, M. Azodinia and M. Mudabbir, Machine learning prediction of gas hydrates phase equilibrium in porous medium, 2024 Search PubMed.
  45. S. Beheshtian, S. K. Roodbari, H. Ghorbani, M. Azodinia, M. Mudabbir and A. R. Varkonyi-Koczy, Comparative evaluation of machine learning and Bayesian deep learning methods for estimating ultimate recoveryin shale well reservoirs, 2024.
  46. S. Beheshtian, S. K. Roodbari, H. Ghorbani, M. Azodinia, M. Mudabbir and A. R. Varkonyi-Koczy, Advanced machine learning methods for accurateprediction of loss circulation in drilling well log, 2024 Search PubMed.
  47. C. Deng, Y. Wang, W. Mi, X. Xie, X. Sun and H. Ghorbani, Nat. Resour. Res., 2025, 1–20 Search PubMed.
  48. X. Hu, J. Xie, X. Li, J. Han, Z. Zhao and H. Ghorbani, Geomech. Geophys. Geo-Energy Geo-Resour., 2025, 11, 96 CrossRef.
  49. S. K. Ahmmad, N. A. Alsaif, M. Shams, A. M. El-Refaey, R. Elsad, Y. Rammah and M. Sadeq, Opt. Mater., 2022, 134, 113145 CrossRef CAS.
  50. D. R. Cassar, S. M. Mastelini, T. Botari, E. Alcobaca, A. C. de Carvalho and E. D. Zanotto, Ceram. Int., 2021, 47, 23958–23972 CrossRef CAS.
  51. D. R. Cassar, G. G. Santos and E. D. Zanotto, Ceram. Int., 2021, 47, 10555–10564 CrossRef CAS.
  52. L. Tao, V. Varshney and Y. Li, J. Chem. Inf. Model., 2021, 61, 5395–5413 CrossRef CAS.
  53. S. Mannan, M. Zaki, S. Bishnoi, D. R. Cassar, J. Jiusti, J. C. F. Faria, J. F. Christensen, N. N. Gosvami, M. M. Smedskjaer and E. D. Zanotto, Acta Mater., 2023, 255, 119046 CrossRef CAS.
  54. C. J. Wilkinson, A. V. DeCeanne, M. Dittmer, C. Ritzberger, M. Rampf and J. C. Mauro, J. Am. Ceram. Soc., 2023, 106, 3418–3425 CrossRef CAS.
  55. Q. Wang, R. Luo, Y. Wang, W. Fang, L. Jiang, Y. Liu, R. Wang, L. Dai, J. Zhao and J. Bi, Adv. Funct. Mater., 2023, 33, 2213296 CrossRef CAS.
  56. S. Huo, H. Qu, F. Meng, Z. Zhang, Z. Yang, S. Zhang, X. Hu and E. Wu, Nano Lett., 2024, 24, 11937–11943 CrossRef CAS.
  57. M. Cao, Y. Qin, X. Cao, S. Wang, J. Liu, S. Xu, Y. Lin, W. Cao and C. He, Theor. Appl. Fract. Mech., 2025, 105092 CrossRef.
  58. J. Liu, W. Luo, L. Wang, J. Zhang, X. Z. Fu and J. L. Luo, Adv. Funct. Mater., 2022, 32, 2110748 CrossRef CAS.
  59. M. Umer, S. Umer, M. Zafari, M. Ha, R. Anand, A. Hajibabaei, A. Abbas, G. Lee and K. S. Kim, J. Mater. Chem. A, 2022, 10, 6679–6689 RSC.
  60. E. KK, G. Veksler and K. BS, Dokl. Akad. Nauk SSSR, 1974, 215, 902–903 Search PubMed.
  61. S. Mitachi and T. Miyashita, Appl. Opt., 1983, 22, 2419–2425 CrossRef CAS PubMed.
  62. K. Ohsawa and T. Shibata, J. Lightwave Technol., 1984, 2, 602–606 Search PubMed.
  63. X. Zhao and S. Sakka, J. Non-Cryst. Solids, 1987, 95, 487–494 CrossRef.
  64. N. Gur’ev, G. Petrovskij, S. Stepanov, E. Stepina-Koroleva and L. Shmatok, Fiz. Khim. Stekla, 1988, 842–847 Search PubMed.
  65. X. Zhao and S. Sakka, J. Mater. Sci., 1988, 23, 3455–3458 CrossRef CAS.
  66. X. Zhao and S. Sakka, J. Non-Cryst. Solids, 1988, 99, 45–58 CrossRef CAS.
  67. U. Wagener and C. Rüssel, J. Non-Cryst. Solids, 1993, 152, 167–171 CrossRef CAS.
  68. V. Fedorov, A. Babitsyna and T. Emel'yanova, Glass Phys. Chem., 2001, 27, 512–519 CrossRef CAS.
  69. A. Babitsyna, T. Emel'yanova and V. Fedorov, Glass Phys. Chem., 2002, 28, 424–432 CrossRef CAS.
  70. M. El-Hofy and I. Hager, Phys. Status Solidi A, 2003, 199, 448–456 CrossRef CAS.
  71. I. Hager and M. El-Hofy, Phys. Status Solidi A, 2003, 198, 7–17 CrossRef CAS.
  72. E. W. Thisted, Electrochemical properties of phosphorus compounds in fluoride melts, 2003 Search PubMed.
  73. S. L. Kraevskii and V. F. Solinov, Glass Phys. Chem., 2006, 32, 271–279 CrossRef CAS.
  74. V. Zinchenko, V. Sobol’, G. Kocherba, O. Mozgovaya and E. Timukhin, J. Opt. Technol., 2012, 79, 437–443 CrossRef CAS.
  75. M. Brekhovskikh, L. Moiseeva, S. K. Batygov, I. Zhidkova and V. Fedorov, Inorg. Mater., 2015, 51, 1348–1361 CrossRef CAS.
  76. G. R. Kumar, C. S. Rao and M. Rao, Optik, 2018, 170, 156–165 CrossRef CAS.
  77. M. Brekhovskikh, L. Moiseeva, V. Shukshin, I. Zhidkova, A. Egorysheva and V. Fedorov, Inorg. Mater., 2019, 55, 173–179 CrossRef CAS.
  78. M. Hossain, T. Borman, A. Kumar, X. Chen, A. Khosravani, S. Kalidindi, E. Paisley, M. Esters, C. Oses and C. Toher, Acta Mater., 2021, 215, 117051 CrossRef CAS.
  79. L. Pauling, J. Am. Chem. Soc., 1932, 54, 3570–3582 CrossRef CAS.
  80. C. R. Landis and F. Weinhold, J. Comput. Chem., 2007, 28, 198–203 CrossRef CAS PubMed.
  81. C. Cheadle, M. P. Vawter, W. J. Freed and K. G. Becker, J. Mol. Diagn., 2003, 5, 73–81 CrossRef CAS.
  82. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio, Commun. ACM, 2020, 63, 139–144 CrossRef.
  83. S. Mirjalili, Knowl.-Based Syst., 2016, 96, 120–133 CrossRef.
  84. Y.-Y. Song and Y. Lu, Shanghai Arch. Psychiatry, 2015, 130–135 Search PubMed.
  85. S. J. Rigatti, J. Insur. Med., 2017, 47, 31–39 Search PubMed.
  86. N. Sharma and A. Juneja, Combining of random forest estimates using LSboost for stock marketindex prediction, 2017.
  87. T. Chen and C. Guestrin, XGBoost: A Scalable Tree Boosting System, 2016 Search PubMed.
  88. W. Wang, Z. Xu, W. Lu and X. Zhang, Neurocomputing, 2003, 55, 643–663 CrossRef.
  89. M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt and B. Scholkopf, IEEE Intell. Syst. Their Appl., 1998, 13, 18–28 Search PubMed.
  90. Y. LeCun, D. Touresky, G. Hinton and T. Sejnowski, A theoretical framework for back-propagation, 1988 Search PubMed.
  91. J. Cheng, L. Dong and M. Lapata, arXiv, 2016, preprint, arXiv:1601.06733 DOI:10.48550/arXiv.1601.06733.
  92. Z. Li, F. Liu, W. Yang, S. Peng and J. Zhou, IEEE Trans. Neural Networks Learn. Syst., 2021, 33, 6999–7019 Search PubMed.
  93. D. F. Specht, IEEE Trans. Neural Networks, 1991, 2, 568–576 CrossRef CAS PubMed.
  94. L. Antwarg, R. M. Miller, B. Shapira and L. Rokach, Expert Syst. Appl., 2021, 186, 115736 CrossRef.

This journal is © The Royal Society of Chemistry 2026
Click here to see how this site uses Cookies. View our privacy policy here.