Open Access Article
Kangyong Ma
*
Department of Chemistry and Chemical Engineering, College of Ecology, Lishui University, Lishui, 323000, China. E-mail: kangyongma@outlook.com
First published on 4th April 2024
As a novel type of oil–water separation material, thermoplastic polyurethane (TPU) porous material exhibits many excellent properties such as low density, high specific surface area, and outstanding oil–water separation performance. However, the performance of thermoplastic polyurethane (TPU) porous materials is often impeded by various factors, and conducting numerous experiments to investigate the relationship between these factors and the adsorption performance can be both expensive and time-consuming. As an alternative to these experiments, machine learning (ML) techniques can be used to estimate experimental results. Therefore, in this study, we developed an integrated hybrid model to predict the adsorption performance of materials and replaced some experiments. We also constructed XGBoost (XGB), Decision Tree Regressor (DT), K-Neighbors Regressor (KNN), Bagging Regression (BGR), and Extra Trees Regression (ETR) single models to predict material properties, all of which exhibited high prediction accuracy. On this basis, SHAP values were employed to explain the influence of single-factor and multi-factor characteristics of such materials on material properties.
Thermoplastic polyurethane (TPU) porous material is a new type of oil–water separation material, which has received widespread attention in the field of oil–water separation due to its low density, high porosity, large specific surface area, three-dimensional interconnected pore structure, hydrophobicity and lipophilicity.7,25 Despite the wide application of TPUs, their performance in practical applications is often affected by a variety of factors such as preparation conditions, pollutant types and environmental conditions.4 Thermoplastic polyurethane (TPU) comprises two key elements: the hard segment, which is obtained by the reaction of isocyanates and diols, and imparts toughness and strength; and the soft segment that provides flexibility and resilience through the reaction of either polyesters or polyethers.4,6
There is a considerable body of literature available today that documents research on thermoplastic polyurethanes and porous materials. For example, Qin et al.7 investigated the hydrophobic obedience of layered porous TPU through thermally induced phase separation in the different solutions concentrations. Ye et al.8 investigated the TPU adsorption under different pH ranges (1–14), temperature (0–90 °C) and flow conditions and performed the quantitative evaluation. Wang et al.9 studied the effect of different pollutant species on the adsorption capacity of TPU porous materials prepared using a simple thermally induced phase separation method. While these studies have made progress in understanding the properties of TPU porous materials, the methods used are often empirical in nature, relying on a trial-and-error approach that is expensive, time-consuming, and environmentally polluting. It is necessary to seek alternative ways and further study the adsorption mechanism to better understand the relationship between TPU porous materials adsorption performance and influencing factors.
Recently, machine learning techniques have garnered significant attention for their exceptional data analysis abilities, and the implementation of advanced machine learning approaches to predictive models has increased.10,11 Pruksawan et al.11 reported the utilization of machine learning for the design and development of bespoke, highly functional materials based on small sample datasets within the domain of materials science. Yan et al.10 demonstrated the potential of machine learning through the successful prediction of corrosion rates through statistical analysis and machine learning algorithms. The study utilized a low-alloy steel marine atmospheric corrosion database to examine the impact of alloying elements and environmental factors on the corrosion behavior of low-alloy steels. These studies highlight the advantages of machine learning for correlation analysis, multivariate fitting, simulation, and data visualization. Therefore, we developed an integrated hybrid machine learning model that includes three basic learners: K-Neighbors Regressor (KNN), Bagging Regression (BGR), Extra Trees Regression (ETR), and an XGBoost (XGB) model and a neural network model to predict the adsorption performance of TPU porous materials, which is more complex than a single prediction model, which is more reliable. Integrating machine learning technology into the research of TPU porous materials is expected to solve the limitations of traditional experimental methods, reduce experimental costs and environmental pollution.
This study combines Shapley value interpretation with machine learning algorithms to construct a prediction model for the adsorption capacity of thermoplastic polyurethane porous materials using experimental data as input. The effect of different preparation conditions on the adsorption properties of TPU porous materials was also investigated through SHapley Additive exPlanations. This work not only demonstrates the potential of machine learning algorithms in predicting material properties and data mining, but also provides new ideas for further research on this type of materials. Fig. 1 shows the experimental procedure of this study.
:
1, v/v) by heating to 80 °C and magnetic stirring for 90 minutes, resulting in a homogeneous TPU suspension. This suspension was then transferred into a glass tube (15 mm in diameter) and subjected to a preliminary phase separation by placement in an ice-water bath at 0 °C for 30 minutes, followed by transfer to a −20 °C environment for a complete phase separation over the course of 12 hours. The resulting TPU porous material was obtained via freeze-drying at −80 °C and 5 Pa for 48 hours.12,13
In addition, by changing the initial concentration (4%, 6%, 8% and 10%), phase separation time (15, 30 min), phase separation temperature (0, 4 °C), mixing ratio (8.5
:
1.5, 9
:
1 and 9.5
:
0.5) prepared a series of TPU porous materials under different experimental conditions and tested their adsorption properties according to the method in Section 2.3.
In this study, an ensemble hybrid model based on a single machine learning prediction model was developed for improving the stability and accuracy of prediction results. The model is implemented to predict the data by introducing base learners (KNN, bagging and extra trees), XGBoost model and LSTM (Long Short Term Memory) model. The base learners (KNN, bagging and extra trees) are first trained with the original dataset before prediction is made on test set. The predictions' outcomes are used as new training and test set. The XGBoost model is trained on a new training set, and grid search is used to further optimize model parameters to improve model performance. Another KNN model is trained based on the comprehensive prediction of the base learner, and the LSTM neural network is introduced to make up for the shortcomings of the traditional machine learning algorithm. Finally, the prediction results of the XGBoost, KNN and LSTM models are weighted and averaged. Among them, LSTM (Long Short-Term Memory) is a particular type of RNN (Recurrent Neural Networks) and is a powerful tool that mitigates the long term memory problem and vanishing problem, which appear to be tricky issues in RNNs.18,19 It has become an effective and scalable model for solving several learning problems related to sequential data. The core idea of LSTM is to replace the summation unit in the hidden layer by introducing a storage unit. It can maintain its state over time, as well as a nonlinear gating unit, which regulates the flow of information in and out of the unit.19 Specifically, the LSTM model consists of the following four main components:
Input gate: it decide how much of the input data to the network at the current moment needs to be saved to the cell state.
Forget gate: it decide how much of the unit state from the previous moment needs to be preserved for the current moment.
Cell state: it is the memory part of the LSTM model, responsible for storing long-term infrmation for use in subsequent time steps.
Output gate: it controls how much of the current cell state needs to be output to the current output value.
In addition, five single machine learning models including Extreme Gradient Boosting (XGBoost), Decision Tree Regression (DT), K-Neighborhood Regression (KNN), Bagging Regression (BGR), and Extra Tree Regression (ETR) were used in this study to predict the adsorption capacity of the thermoplastic polyurethane porous material. The parameters of these models were determined by grid search and their specific parameters are shown in Table 1.
| Models | Model parameters |
|---|---|
| XGBoost | n_Estimators = 60, learning_rate = 0.04, max_depth = 6, min_child_weight = 2 |
| Decision tree | Max_depth = 5, random_state = 50, min_samples_split = 2 |
| K-neighbors | Algorithm = ‘ball_tree’, leaf_size = 2, n_neighbors = 3 |
| Bagging | Max_features = 10, max_samples = 40, n_estimators = 100 |
| Extra tree | Max_depth = 12, min_samples_leaf = 2, min_samples_split = 7 |
| Ensemble hybrid | Epochs = 500, number of neurons = 128, dropout = 0.1, batch size = 32 |
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
The statistical analysis and data mining tasks were conducted using the Python software and the Scikit-Learn tools.
In this study, Pearson correlation matrix was utilized to understand the correlation of each feature with other features. Pearson correlation analysis was performed on these features, and the results are shown in Fig. 3. A positive value indicates a positive correlation, while a negative value indicates a negative correlation. This shows that there is a significant negative correlation between concentration and adsorption compared to other characteristic values. Although the Pearson correlation matrix shows full information about the correlation between each attribute and the other attributes, the effect of concentration on the amount of adsorption compared to the other attributes will be the main topic of discussion. One of the main techniques in machine learning is to select input features based on the Pearson coefficient. Since there is no obvious linear relationship between the input features, and the correlation coefficient does not exceed 0.8, all input features are required and the data must be standardized to prevent inaccurate model calculations.
| Models | Accuracy of models | |||
|---|---|---|---|---|
| R2 | MAE | RMSE | MAPE | |
| XGBoost | 0.8889 | 0.85 | 1.08 | 0.171 |
| Ensemble hybrid | 0.9403 | 0.66 | 0.79 | 0.176 |
| Decision tree | 0.8251 | 0.68 | 0.89 | 0.206 |
| K-neighbors | 0.7062 | 0.88 | 1.16 | 0.262 |
| Bagging | 0.8113 | 0.70 | 0.93 | 0.188 |
| Extra tree | 0.7961 | 0.52 | 0.97 | 0.139 |
These results indicate that single machine learning models may suffer from overfitting or underfitting issues, and are sensitive to noise and outliers, while ensemble hybrid models can better handle these issues. Ensemble hybrid models can combine the strengths of multiple models, improve model accuracy and stability, and usually have stronger predictive capabilities. Therefore, when dealing with more complex datasets, using ensemble hybrid models may be more suitable than single machine learning models.
Fig. 5 shows the regression plots for all models during prediction and training phases. The x-axis in each plot represents the observed values in the training samples, and the y-axis represents the predicted values by the models. The red line in each plot represents perfect prediction, where the observed values and predicted values are identical. The other radial lines represent prediction errors within 15% and 30% of the red line. If all data points are on the red line with y = x equation, it means that the model can predict the actual values without any error. It can be observed from the plot that the ensemble hybrid model not only has the highest R2 value but also has the most similar equation to y = x, indicating an excellent predictive performance. In addition, the integrated model can better utilize the advantages of different models and reduce the overall prediction error therefore the MAE and RMSE indicators of this model also perform better with 0.66 and 0.79 respectively. Other models also show good predictive performance but are inferior to the ensemble hybrid model.
In addition, the Fig. 5 shows the performance of the model on both the training and test sets. As there is no significant difference between the model's performance on the two different datasets, it indicates that the model established in this study did not exhibit obvious overfitting. To mitigate the overfitting risk of the ensemble hybrid model due to its complex structure, we introduced dropout and regularization during the development of the model. Dropout can randomly ignore a portion of neurons during the training process, which prevents the model from relying too much on specific neurons and thus improves generalization ability. Regularization method can add regularization terms to the loss function, which makes the model more inclined to choose smaller weight values, thereby reducing model complexity and lowering the risk of overfitting. The results show that these measures can effectively avoid overfitting of complex models on small sample datasets.
Furthermore, this study also explores the contribution of preparation conditions to the model using SHAP value analysis to identify important features, which provides new ideas for exploring how to improve the performance of TPU porous materials.
In this plot, each data point represents a sample, and each feature has a bar graph representing the distribution of its SHAP values. The color of the bar graph represents the value of the feature in the sample, the darker the color, the higher the value, and the lighter the color, the lower the value. The position of the bar graph indicates the degree of influence of the feature on the model output. The left shift of the bar graph means that the feature has a greater negative impact on the model output, and the right shift of the bar graph means that the feature has a positive effect on the model output. This graph can help researchers quickly identify which features are most important for model output, so as to perform feature selection or optimize model performance.22
The features in the SHAP feature importance map are listed in descending order of importance as follows: concentration, mixing ratio, time and temperature are shown in Fig. 6, which helps to understand which features has the greatest impact on the prediction results of the model, in order to perform feature selection or adjust model parameters, optimize model performance, or further explore the relationship between these features and target variables.
The feature importance plot provides a visual representation of the importance of each feature in predicting the target variable. However, to evaluate the features comprehensively, both the feature importance and its impact on the prediction should be considered simultaneously. The SHAP summary plot integrates these two aspects to provide a more comprehensive view. The y-axis of the plot describes the features, while the x-axis represents the corresponding SHAP values. The points in the plot are color-coded based on their feature values, with low values indicated in blue and high values indicated in red. The points located on the right side of the zero line indicate a positive effect on the adsorption capacity, while those on the left side indicate a negative effect. Our results show that the concentration of the initials and the mixture ratio have a significant effect on the adsorption capacity, while the effects of the two preparation conditions, time and phase separation temperature, are relatively weak.
Furthermore, the initial concentration has a negative impact on the adsorption capacity, with a larger initial concentration resulting in a greater negative effect, while a lower concentration has a positive effect. However, the adsorption capacity of such materials is often influenced by multiple factors simultaneously.
Fig. 7 explains the effect of a single characteristic variable on the adsorption capacity, but not the effect of multivariate combinations on the adsorption capacity. Interaction plots can reflect the importance of individual features and feature combinations, and rank them to determine the importance of feature combinations. Therefore, we use the SHAP interaction diagram for further investigation. The concentration-mixing-ratio feature combination in Fig. 9 is the most important feature combination. TPU porous materials have a hierarchical porous structure as described in the literature25 and shown in Fig. 8. In addition, according to the literature, concentration has a significant effect on the microporous skeleton structure of the material.25 The mixing ratio plays a crucial role in determining the number of micropores in the microporous skeleton.
Although a low concentration would increase the porosity of the material, the adsorption capacity would also decrease when the mixing ratio (1,4-dioxane: deionized water) was too small, because too small a mixing ratio would lead to an increase in the number of surfaces micropores. When the number of micropores is too large, the skeleton of the material will collapse and the adsorption capacity will decrease. As shown in Fig. 9, higher or lower concentrations and mixing ratios can negatively affect the adsorption capacity. In summary, the adsorption performance of TPU porous materials is affected by the synergistic effect of various factors, and the order of importance is shown in Fig. 9. SHAP value interpretation illustrates the complex functional relationship between variable combinations in a more intuitive way, laying the foundation for further development of such materials.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4ra00010b |
| This journal is © The Royal Society of Chemistry 2024 |