Open Access Article
Atthaphon Ariyarit
a,
Attasit Wiangkhamb,
Phatthawit Siripaiboonsuba,
Jittiwat Nithikarnjanatharnc,
Wannisa Nutkhumc and
Prasert Aengchuan
*d
aSchool of Mechanical Engineering, Institute of Engineering, Suranaree University of Technology, Muang, Nakhon Ratchasima 30000, Thailand
bDepartment of Industrial and Logistics Engineering, Faculty of Engineering, Srinakharinwirot University, Ongkharak, Nakhon Nayok 26120, Thailand
cDepartment of Industrial Engineering, Faculty of Engineering and Technology, Rajamangala University of Technology Isan, Muang, Nakhon Ratchasima 30000, Thailand
dSchool of Manufacturing Engineering, Institute of Engineering, Suranaree University of Technology, Muang, Nakhon Ratchasima 30000, Thailand. E-mail: prasert.a@sut.ac.th
First published on 30th October 2025
Polylactic acid (PLA) composites reinforced with spent coffee grounds (SCG) and modified with a silane coupling agent (VTMS) offer a sustainable alternative for applications requiring biodegradability and enhanced mechanical performance. This study employed a data-driven approach to optimize tensile strength and Shore D hardness by varying the contents of PLA, SCG, and silane. Seventy-five composite samples were fabricated and tested, exhibiting tensile strengths of 26.5–57.9 MPa and hardness values of 77.5–80.8 Shore D. A multi-output XGBoost regression model, trained on 60% of the data and validated on the remaining 40%, achieved strong predictive accuracy (R2 = 0.884, MSE = 12.64 for tensile strength; R2 = 0.908, MSE = 0.071 for hardness) after augmentation with 159 synthetic samples generated via jittering, Gaussian noise, and kernel density estimation. Multi-objective optimization using NSGA-II simultaneously maximized both properties, revealing Pareto-optimal compositions dominated by higher PLA and moderate SCG and silane contents. The best formulation (1490 g PLA, 121 g SCG, 20 g silane) achieved 53.33 MPa tensile strength and 80.06 Shore D hardness. The combined XGBoost-NSGA-II framework demonstrates an efficient, data-driven strategy for optimizing bio-composite performance while minimizing experimental effort.
Despite its promising environmental profile, neat PLA suffers from inherent brittleness and limited impact resistance, which restrict its applications in load-bearing or high-stress environments. One common strategy to overcome these limitations involves reinforcing PLA with natural fillers. Among them, spent coffee grounds (SCG), a readily available, low-cost agricultural waste, have attracted attention for their ability to enhance the stiffness, thermal stability and biodegradability of PLA-based composites.12–14 Moreover, the use of silane coupling agents such as vinyltrimethoxysilane (VTMS) can improve the interfacial adhesion between the hydrophilic filler and hydrophobic PLA matrix, thus promoting efficient load transfer and enhancing mechanical properties.15,16 Several experimental studies and statistical design methods, including response surface methodology (RSM), have been employed to investigate the effects of SCG content and silane treatment on the mechanical performance of PLA composites.17,18 These properties are highly sensitive to formulation parameters such as filler content, particle dispersion and surface chemistry, necessitating precise control over composition and processing.19,20
Traditional trial-and-error experimentation is often time-consuming and inefficient. Recently, artificial intelligence (AI) and machine learning (ML) have been increasingly applied in materials research to model complex relationships between composition and properties.21–25 Recently, several studies have demonstrated the capability of ML models to accurately predict the mechanical behavior of polymer composites and optimize material performance. For example, Ulkir et al. employed artificial neural networks and fuzzy logic to predict the mechanical properties of 3D-printed PLA/wood composites,26 Omigbodun et al. utilized XGBoost and AdaBoost algorithms to model and enhance the mechanical performance of PLA/cHAP scaffolds for biomedical use,27 and Crupano et al. investigated 3D-printed PLA/PHB composites to support data-driven analysis of compressive and fatigue behavior.28 similarly, Fasikaw et al. demonstrated that AI models can successfully predict polymer composite behavior, while Lee et al. compared regression algorithms for metal forming, highlighting the superiority of tree-based methods.29,30 ML techniques such as linear regression, support vector machines, neural networks and tree-based algorithms have been applied to predict tensile strength, Shore D hardness and other critical properties. Among these, Extreme Gradient Boosting (XGBoost) has emerged as a robust and interpretable tool capable of handling nonlinear, high-dimensional data while providing feature importance metrics that offer practical insights into the influence of compositional factors such as PLA, SCG and silane.31,32
In many materials design problems, especially for multifunctional composites, multiple objectives such as tensile strength and surface Shore D hardness must be simultaneously optimized, which introduces trade-offs. Single-objective optimization approaches are insufficient in such scenarios. Consequently, multi-objective optimization algorithms like the Non-dominated Sorting Genetic Algorithm II (NSGA-II) have been widely adopted for their ability to efficiently explore large, multi-dimensional design spaces and generate a diverse set of Pareto-optimal solutions.33–36 NSGA-II was selected in this study over alternative methods such as multi-objective particle swarm optimization (MOPSO) and MOEA/D due to its strong balance between convergence speed and population diversity, as well as its proven effectiveness in discrete and high-dimensional composite optimization problems.25
In this study, we propose an integrated, data-driven framework that combines experimental testing, synthetic data generation, and multi-objective optimization. A multi-output XGBoost regression model is trained on both physical and synthetically augmented data to predict the tensile strength and Shore D hardness of PLA/SCG/Silane composites. Synthetic data are generated using techniques such as jittering, Gaussian noise injection, interpolation, and kernel density estimation (KDE), which enhance the diversity and coverage of the design space without compromising physical plausibility.
Unlike previous studies that primarily relied on linear regression or neural network models, this work leverages the XGBoost algorithm for multi-output prediction, which offers superior interpretability, fast convergence, and robustness to small and imbalanced datasets conditions often encountered in experimental materials research. The integration of synthetic data generation with XGBoost enables effective learning from limited samples, while the feature-importance metrics provide quantitative insights into the influence of each compositional factor. The trained surrogate model is further embedded into the NSGA-II framework to efficiently identify Pareto-optimal composite formulations. This combined approach establishes a scalable, accurate, and cost-effective pathway for the rational design and optimization of high-performance, sustainable bio-composite materials.
Based on the CCD approach, 15 distinct formulations were generated and each was replicated 5 times, resulting in a total of 75 composite samples for mechanical testing. Three independent variables PLA, SCG and silane, as shown in Table 1.
| PLA (g) | SCG (g) | Silane (g) |
|---|---|---|
| 1409 | 359 | 60 |
| 1275 | 225 | 75 |
| 1141 | 359 | 15 |
| 1409 | 359 | 15 |
| 1141 | 91 | 60 |
| 1275 | 0 | 38 |
| 1275 | 225 | 38 |
| 1050 | 225 | 38 |
| 1500 | 225 | 38 |
| 1275 | 225 | 0 |
| 1275 | 450 | 38 |
| 1409 | 91 | 15 |
| 1409 | 91 | 60 |
| 1141 | 91 | 15 |
| 1141 | 359 | 60 |
![]() | ||
| Fig. 2 Raw materials used for composite preparation: (a) polylactic acid (PLA); (b) spent coffee grounds (SCG); and (c) vinyltrimethoxysilane (VTMS). | ||
The blended materials were subsequently compounded using a twin-screw extruder operated at 180 °C under controlled thermal and shear conditions to ensure uniform dispersion and promote chemical interaction between the PLA matrix and the surface treated filler. The extruded material initially emerged in the form of continuous strands, as shown in Fig. 3(a), which were then cooled and mechanically pelletized into granules, as illustrated in Fig. 3(b). These pellets were subjected to an additional drying cycle under the same conditions prior to molding.
![]() | ||
| Fig. 3 Extruded bio-composite materials at different processing stages: (a) continuous strands from twin-screw extrusion; (b) pelletized granules after mechanical grinding. | ||
Importantly, the extrusion process was performed sequentially, beginning with the formulation containing the lowest SCG content and gradually progressing to those with higher filler loadings. This order of processing was implemented to minimize filler carry over and prevent cross contamination between different formulations.
![]() | ||
| Fig. 4 (a) Dimensions of the test specimen for tensile testing; (b) tensile testing machine; (c) dimensions of the Shore D hardness test specimen; and (d) Shore D durometer. | ||
Tensile test specimens were fabricated using a vertical injection molding machine, which provides high precision and consistency in specimen formation. The geometry of the tensile specimens followed the ASTM D638 Type IV standard.40,42,43 as shown in Fig. 4(a) presents the specimen dimensions in millimeters (mm), which is widely used for testing thermoplastics and composite materials with limited thickness.
For Shore D hardness testing, the specimens were prepared and measured in accordance with the ASTM D2240 standard,41–44 as illustrated in Fig. 4(c). The specimens used for this test were cube in shape, with equal side lengths to ensure consistent contact with the Shore D hardness indenter. This method allows for consistent and reliable assessment of surface hardness in rigid plastic materials.
After molding, all specimens were conditioned at room temperature for 48 hours to relieve residual stresses and stabilize dimensional properties prior to mechanical testing.
To address these challenges, this study proposes an integrated framework that leverages artificial intelligence (AI) for property prediction and evolutionary algorithms for multi-objective optimization. Specifically, Extreme Gradient Boosting (XGBoost) was adopted as the surrogate model due to its high accuracy, robustness to overfitting and ability to handle complex nonlinear datasets.45,46
The experimental data used to train the model were generated using central composite design (CCD), a widely used statistical approach under the response surface methodology (RSM) framework. While CCD provides an efficient means of exploring factor interactions with a limited number of experimental runs, the resulting dataset often lacks sufficient coverage of the full compositional space. To mitigate this limitation and enhance the generalization capacity of the surrogate model, synthetic data were generated using statistical augmentation techniques including jittering, Gaussian noise injection, interpolation and targeted sampling of low-density regions.47,48 These synthetic samples preserved the statistical structure and physical plausibility of the original data while significantly increasing the diversity of training inputs. The overall workflow of the proposed data-driven optimization process, integrating experimental data, synthetic augmentation, XGBoost modeling, and NSGA-II optimization, is illustrated in Fig. 5.
![]() | ||
| Fig. 5 Workflow of the AI-assisted multi-objective optimization framework integrating XGBoost prediction and synthetic data augmentation. | ||
The augmented dataset was then used to train a multi-output XGBoost model capable of predicting both tensile strength and Shore D hardness based on the composite formulation inputs: PLA, SCG and silane content. Once trained, the XGBoost model was embedded within the Non-dominated Sorting Genetic Algorithm II (NSGA-II) framework to perform multi-objective optimization. The goal was to simultaneously maximize both tensile strength and Shore D hardness, with the surrogate model guiding the search over a wide design space without requiring additional physical experiments. This integrated approach offers a scalable, data driven pathway for discovering optimal bio-composite formulations and advancing sustainable material development.
Additional mathematical formulations and detailed algorithmic procedures for data augmentation, XGBoost training, and NSGA-II optimization are provided in the SI.
To enhance the predictive capability of the machine learning model, synthetic data were subsequently generated and combined with the original experimental dataset. This augmented dataset was used to train and compare the performance of XGBoost models, enabling a direct evaluation of whether the inclusion of synthetic data improves model accuracy and generalization. The performance of both models trained on original data and on the augmented dataset was assessed based on standard metrics, including R2 and MSE, for both tensile strength and Shore D hardness.
Following model evaluation, the optimized model was integrated into the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to perform multi-objective optimization. The optimization aimed to simultaneously maximize tensile strength and Shore D hardness, generating a well-distributed set of Pareto-optimal solutions. To support the decision-making process, a composite performance score calculated from normalized tensile strength and Shore D hardness was used to rank the solutions and identify the most well-balanced formulations.
Finally, the top five optimized formulations with the highest composite scores are presented and discussed, providing practical guidelines for selecting compositions that achieve an optimal balance between mechanical performance and material efficiency.
The plot reveals several notable trends across the dataset. Specimens with high PLA content generally exhibited higher tensile strength, suggesting that PLA plays a dominant role in structural reinforcement. This is clearly illustrated by the green and purple lines, where higher PLA levels correspond to increased tensile strength and Shore D hardness. Conversely, when PLA content is low as seen in the same green and purple lines the values for both mechanical properties tend to decrease. SCG content, on the other hand, showed an inverse trend. As indicated by the blue and black lines, increasing SCG content is associated with a reduction in tensile strength, likely due to the lower stiffness and poor bonding ability of spent coffee grounds. Interestingly, these lines also reveal that higher SCG levels can slightly enhance Shore D hardness, while lower SCG levels tend to support higher tensile strength but lower Shore D hardness values. The effect of silane (VTMS) is more nuanced. The yellow and red lines show that increasing silane content tends to improve both tensile strength and Shore D hardness, particularly at moderate to high dosages. This trend suggests that silane enhances interfacial bonding between PLA and SCG, contributing to better stress transfer and mechanical performance.
Overall, the PCP plot provides insight into how compositional factors influence the performance of the composites, helping to identify input combinations that yield favorable mechanical outcomes and guiding future model development and optimization efforts.
The model's performance was assessed in terms of the coefficient of determination R2 and mean squared error (MSE) for both tensile strength and Shore D hardness. As shown in Figures 7(a)–(d), the predicted values are closely aligned with the experimental results, with most data points distributed along the 45° reference line. For tensile strength, the model trained with synthetic data achieved a slightly higher test R2 of 0.884 and a lower MSE of 12.641, compared to 0.881 and 13.608 for the model trained on original data. A more substantial improvement was observed in the Shore D hardness prediction, where R2 increased from 0.723 to 0.908, and MSE decreased markedly from 0.431 to 0.071 after data augmentation. These results confirm that the addition of synthetic data enhanced the model's predictive accuracy and consistency, particularly for hardness prediction, which initially exhibited greater variability. Such improvements echo findings in other materials prediction studies using XGBoost and data augmentation.49
Table 2 presents the results of the 5-fold cross-validation conducted to evaluate the stability and generalization of the XGBoost model. Each fold used 80% of the data for training and 20% for validation. The synthetic-augmented model achieved a slightly higher average R2 value (0.859) compared to the model trained on original data (0.837), indicating improved predictive consistency across folds. Although the average MSE increased marginally (4.171 vs. 3.576), the overall variation among folds was reduced, suggesting that the inclusion of synthetic data enhanced model robustness and reduced overfitting. This behavior aligns with literature reporting that boosting algorithms are effective for small or imbalanced datasets when augmented training data are available.50,51
| K fold (run fold) | Average R2 macro (original data) | Average MSE macro (original data) | Average R2 macro (synthetic data) | Average MSE macro (synthetic data) |
|---|---|---|---|---|
| Run 1 (fold = 1) | 0.893 | 3.400 | 0.881 | 3.469 |
| Run 2 (fold = 2) | 0.745 | 2.751 | 0.887 | 3.347 |
| Run 3 (fold = 3) | 0.816 | 6.416 | 0.841 | 4.692 |
| Run 4 (fold = 4) | 0.839 | 2.224 | 0.793 | 4.937 |
| Run 5 (fold = 5) | 0.892 | 3.090 | 0.891 | 4.409 |
| Average | 0.837 | 3.576 | 0.859 | 4.171 |
Based on these findings, the synthetic-augmented XGBoost model was selected as the surrogate model for subsequent NSGA-II multi-objective optimization, providing a balanced trade-off between prediction accuracy and stability across different data partitions. To further optimize the mechanical performance of PLA/SCG/Silane bio-composites, the NSGA-II algorithm was employed to simultaneously maximize tensile strength and Shore D hardness. The resulting Pareto-optimal front, shown in Fig. 8, reveals a smooth trade-off between the two objectives, with optimal formulations concentrated in regions of higher tensile strength while maintaining or slightly improving hardness. The use of NSGA-II for exploring trade-offs in composite design is well established in engineering and materials optimization literature.52,53 These results indicate that increasing PLA content while maintaining moderate SCG and silane levels yields superior overall mechanical performance, offering practical guidance for designing composite formulations that meet specific application requirements.
![]() | ||
| Fig. 8 Comparison of experimental data and NSGA-II Pareto-optimal solutions for tensile strength and Shore D hardness. | ||
To facilitate decision-making among the Pareto-optimal solutions, a set of composite score formulas was introduced to evaluate the overall performance of each candidate. These composite scores combine tensile strength and Shore D hardness into a single performance index using different weighting strategies and mathematical models. Five representative formulations obtained from the NSGA-II optimization are summarized in Table 3, showing their corresponding tensile strength, Shore D hardness, and composite scores. The composite score serves as an integrated indicator that balances both mechanical properties, allowing the identification of the most optimal formulation.
| PLA (g) | SCG (g) | Silane (g) | Tensile (MPa) | Shore D hardness | Composite Score |
|---|---|---|---|---|---|
| 1490 | 121 | 20 | 53.33 | 80.06 | 1.5282 |
| 1426 | 121 | 20 | 53.46 | 80.03 | 1.5214 |
| 1496 | 162 | 20 | 49.40 | 80.11 | 1.3703 |
| 1426 | 162 | 20 | 49.57 | 80.08 | 1.3651 |
| 1496 | 60 | 20 | 54.17 | 79.62 | 1.3613 |
The composite score was introduced as a normalized performance index to evaluate the overall mechanical quality of each formulation by simultaneously considering both tensile strength and Shore D hardness. It was calculated as the sum of the normalized values of these two properties, allowing a balanced comparison between formulations with different trade-offs. A higher composite score indicates better combined mechanical performance.
Table 3 summarizes the representative composite formulations obtained from the NSGA-II optimization process, along with their corresponding tensile strength, Shore D hardness, and composite scores. In these optimized formulations, the silane content (VTMS) was fixed at 20 g, while the PLA and SCG contents were varied to study their combined influence on mechanical behavior. A detailed mathematical definition and normalization procedure used for computing the composite score are provided in the SI for completeness and reproducibility.
The results indicate that formulations with lower SCG content tend to achieve higher composite scores and better mechanical properties. The highest composite score of 1.5282 was obtained for the formulation containing 1490 g PLA, 121 g SCG, and 20 g Silane, which exhibited a tensile strength of 53.33 MPa and a Shore D hardness of 80.06. A similar formulation with 1426 g PLA and the same SCG and silane levels followed closely with a score of 1.5214.
Conversely, increasing the SCG content to 162 g caused a decline in tensile strength to 49.40–49.57 MPa despite a slight gain in hardness, resulting in lower composite scores (1.3703–1.3651). This suggests that excessive SCG weakens the structural integrity of the composite due to its lower intrinsic strength compared with the PLA matrix.
Interestingly, the formulation achieving the highest tensile strength of 54.17 MPa and a hardness of 79.62 contained only 60 g of SCG, yielding a high composite score of 1.3613. These observations confirm that moderate filler loading enhances the overall mechanical performance of PLA-based bio-composites.
The original dataset comprised 75 experimentally derived samples generated using a Central Composite Design (CCD) approach. Although CCD efficiently captures interactions and quadratic effects with a reduced number of experiments, its limited coverage of the input space poses challenges for training machine learning models with high generalization capability. Therefore, the dataset was randomly shuffled and partitioned into three distinct subsets: 60% training set: used for model fitting and data augmentation, 20% validation set: used for hyperparameter tuning and performance monitoring, 20% test set: reserved for final model evaluation. synthetic data augmentation was applied exclusively to the training subset to prevent information leakage. Four techniques were employed to generate synthetic samples:54–57
(1) Jittering eqn (1): small Gaussian noise is added to both the input and output variables to create local perturbations around the original training points:
= X + εX, Ỹ = Y + εY
| (1) |
represents the synthesized input feature vector after noise injection,
is the original input vector (e.g., PLA, SCG and Silane content), εX is a small random noise drawn from a normal distribution and Ỹ, Y and εY ∼ N(0, σ2) are the corresponding output vector, actual value and noise respectively.
(2) Gaussian Sampling eqn (2): new samples are drawn from a multivariate Gaussian distribution based on the mean and covariance of the original dataset:
∼ N(μX, ΣX), Ỹ = Ynearest + εY
| (2) |
represents the input vector synthesized by drawing samples from a multivariate normal distribution with meanμX and covariance matrix ΣX calculated from the original dataset, Ynearest is the output value of the nearest neighbor in the training set and εY ∼ N(0, σ2) is small noise added to preserve variability.
(3) Interpolation eqn (3): new data points are synthesized by convex combinations of randomly selected input pairs:
= αXi + (1 − α)Xj, Ỹ = αYi + (1 − α)Yj
| (3) |
is the new input vector created by interpolating between two randomly selected input vectors Xi and Xj from the training data,α ∈ [0, 1] is a randomly chosen interpolation coefficient and Ỹ is the interpolated output vector from corresponding values Yi and Yj.
(4) Targeted Sampling eqn (4): low-density regions in the input space, identified by kernel density estimation (KDE), are perturbed to create new samples:
= Xk + δX, Ỹ = Yk + δY
| (4) |
is an input vector located in a low density region of the training space as identified by kernel density estimation, δX ∼ N(0, τ2) is perturbation applied to the input and Ỹ is the corresponding synthesized output value obtained by adding small noise δY to the actual target Yk.
The synthetic data generated from the training set were combined with the original training data to form an augmented dataset, while the test set remained untouched for independent validation.
At the beginning of the learning phase, the model initializes with a baseline prediction, P(0), which is typically the mean value of each target variable (tensile strength and Shore D hardness) in the training set, as shown in eqn (5):
![]() | (5) |
Following this initialization, residuals are computed at each t iteration to quantify the error between actual and predicted values. These residuals guide the learning of subsequent trees, as shown in eqn (6):
| ri(t) = Ai − Pi(t) | (6) |
A regression tree h(t)(xi) is trained to minimize the squared error between the predicted residuals and the actual residuals from the previous iteration, as shown in eqn (7):
![]() | (7) |
Once the regression tree is fitted, the model undergoes optimization through the regularized objective function, which balances the prediction error and model complexity. This optimization process is guided by gradient boosting principles and is formulated in eqn (8):
![]() | (8) |
After optimizing the objective function, the model is updated by incorporating the output of the newly fitted regression tree. This update adjusts the previous prediction by adding a scaled contribution from the new tree, as shown in eqn (9):
| Pi(t+1) = Pi(t) + η × h(t)(xi) | (9) |
This iterative learning process continues until the predefined number of boosting rounds (iterations) is reached or until convergence criteria are satisfied. The final prediction of the model after T iterations is obtained by aggregating the contributions from all individual trees, as shown in eqn (10):
![]() | (10) |
According to the XGBoost learning framework, the predictive performance of the model is highly dependent on several hyperparameters that govern how the model fits the data, manages complexity and avoids overfitting. In this study, key hyperparameters were optimized using Optuna with the Tree-structured Parzen Estimator (TPE) sampler. The search space included the number of estimators (ranging from 100 to 500), maximum tree depth (3 to 8), learning rate (0.001 to 0.1 on a logarithmic scale), subsample ratio (0.6 to 1.0), column sampling ratio per tree (0.6 to 1.0), gamma (0 to 0.4) for minimum loss reduction required to make a split and minimum child weight (1 to 4), which controls the minimum sum of instance weights needed in a child node. A total of 50 optimization trials were performed, with validation set mean squared error (MSE) used as the evaluation metric. The best-performing hyperparameter configuration was then used to train the final XGBoost model on the combined training and validation sets. This systematic optimization strategy effectively enhanced model generalization and predictive robustness across both target properties.
To assess the predictive performance of the XGBoost surrogate model constructed in this study, two widely accepted statistical metrics were employed: the coefficient of determination (R2) and the mean squared error MSE. These regression evaluation metrics were used to quantify the model's ability to accurately estimate the mechanical properties (tensile strength and Shore D hardness) based on input features (PLA, SCG and silane content).
The coefficient of determination, denoted as R2, measures the proportion of the variance in the dependent variable (i.e., experimental data) that is predictable from the independent variables (i.e., model predictions). The R2 value ranges from negative infinity to 1, where a value closer to 1 indicates a better fit.60 The mathematical formulation of R2 is shown in eqn (11)
![]() | (11) |
The second metric, mean squared error (MSE), evaluates the average of the squared differences between predicted and actual values, placing greater weight on larger errors.61 The formulation of MSE is given in eqn (12):
![]() | (12) |
To effectively address this challenge, the present study employed the Non-dominated Sorting Genetic Algorithm II (NSGA-II), a widely recognized multi-objective evolutionary algorithm. NSGA-II is particularly valued for its ability to maintain solution diversity (population heterogeneity) while converging toward the Pareto-optimal front. Its robust performance and computational efficiency have led to its successful application across a range of domains, including materials engineering. In this study, NSGA-II was implemented to simultaneously maximize the tensile strength and Shore D hardness of PLA-based bio-composites. The corresponding optimization problem was defined with two objectives and three decision variables (inputs), as outlined in eqn (13) and (14).
![]() | (13) |
![]() | (14) |
i = [x1, x2, x3]
| (15) |
i refers to the i in the population. It is a vector of decision variables used to predict the objective function values.
In step 2, evaluation, each individual in the population was evaluated using a pre-trained XGBoost model to predict the values of the two objective functions: tensile strength and Shore D hardness. Since the DEAP framework performs minimization by default, the predicted values were negated in the fitness function, as shown in eqn (16)
f1(Xi) = −y1 , f2(Xi) = −y2
| (16) |
In Step 3, non-dominated Sorting was applied to rank individuals based on Pareto dominance. Each solution in the population was compared pairwise to determine whether it is dominated by or dominates others, according to the following condition: a solution A dominates solution B if it is no worse in all objectives and strictly better in at least one. This classification resulted in a hierarchy of Pareto fronts, where the first front (F1) consists of non-dominated solutions and subsequent fronts contain solutions that are dominated by those in the preceding fronts. To ensure population diversity, a crowding distance metric was applied to each front. The crowding distance quantifies the density of solutions in the objective space by estimating the proximity of each individual to its neighbors. For each individual i, the crowding distance di was calculated using normalized distances across all objective functions, as defined in eqn (17):
![]() | (17) |
In Step 4, Genetic Operations, once the solutions were sorted based on Pareto dominance and evaluated for diversity using crowding distance, genetic operators such as selection, crossover and mutation were applied to generate new offspring. The selection process was performed using a binary tournament selection method, which selects the fittest individuals based on their non-dominated rank and crowding distance, ensuring a balance between convergence and diversity.
Crossover and mutation were then employed to introduce variation into the population. The crossover operation combines the genetic information of two parent individuals to produce new offspring, while mutation introduces small random perturbations to a single individual. These operations allow the algorithm to explore new regions of the solution space and potentially discover better-performing combinations than those in the previous generation. the two genetic operations can be expressed as follows in eqn (18) and (19)
![]() | (18) |
![]() | (19) |
is the resulting offspring. The crossover was implemented using uniform crossover with a probability of 0.5 and mutation was performed using uniform integer mutation with a per-gene probability of 0.2, within the predefined bounds of each variable.
In Step 5, After generating the offspring population through crossover and mutation, the algorithm proceeded to the survival selection phase. In this step, the parent and offspring populations were merged to form a combined population of double the original size. Non-dominated sorting was then reapplied to this combined population to reclassify all individuals into updated Pareto fronts. Selection was performed based on two primary criteria: (1) Pareto rank, where individuals in lower-ranked fronts are preferred and (2) crowding distance, which prioritizes individuals located in sparsely populated regions of the objective space to preserve solution diversity. The best N individuals where N is the original population size were selected from the top-ranked fronts to form the population for the next generation. This elitist selection strategy ensures that the most competitive solutions are retained, while also allowing new and diverse candidates to contribute to the ongoing evolutionary process.
In step 6, the optimization process continued iteratively through multiple generations, with each cycle involving evaluation, non-dominated sorting, variation and selection. The process was terminated once a predefined stopping criterion was reached. In this study, the termination condition was set to a fixed number of 50 generations. At the conclusion of the optimization, the final population represented a diverse set of Pareto-optimal solutions, illustrating the trade-off frontier between the two mechanical objectives: tensile strength and Shore D hardness. To further analyze the results, the predicted tensile strength and Shore D hardness values were normalized using min–max scaling and combined to compute a composite performance score, facilitating the identification of well-balanced formulations. The resulting solutions were visualized as a Pareto front, enabling clear decision-making based on performance trade-offs.
In conclusion, the NSGA-II algorithm was effectively implemented to address the multi-objective optimization problem inherent in the design of PLA/SCG/Silane bio-composites. The optimization process was configured with a population size of 50 and run for 50 generations, providing a practical balance between search space exploration and computational efficiency. The genetic operators were set as follows. Uniform crossover was applied with a probability of 0.5. Uniform integer mutation was configured with a global mutation probability of 0.4 and a per-gene mutation probability of 0.2. These operators were selected to maintain sufficient genetic diversity across the population and to explore the search space thoroughly. Selection and survival strategies were based on non-dominated sorting and crowding distance, ensuring both convergence toward the Pareto front and the preservation of solution diversity. Importantly, the evaluation of each candidate solution was performed using a pre-trained XGBoost surrogate model, which enabled rapid approximation of mechanical performance without the need for additional physical testing. Together, these parameter settings and algorithmic choices enabled the NSGA-II framework to efficiently identify a wide range of optimal formulations that balance the dual objectives of tensile strength and Shore D hardness.
The resulting Pareto front serves as a valuable design tool for the future development of bio-composites with tailored mechanical properties. Such a framework can accelerate material formulation decisions in applications requiring trade-offs between strength and durability, such as structural packaging or biodegradable consumer products.
Each property (tensile strength and Shore D hardness) was normalized using the min–max method as shown in eqn (20):
![]() | (20) |
The composite score was then calculated as the sum of the two normalized properties according to eqn (21):
| CS = XTS,norm + XHD,norm | (21) |
A higher CS value indicates a better combined mechanical performance, reflecting the formulation that simultaneously achieves high tensile strength and high hardness. This normalization procedure ensures that both mechanical properties contribute equally and objectively to the evaluation, avoiding any unit bias or dominance of one property over the other.
By augmenting the original dataset with 159 synthetic samples, the XGBoost model achieved improved predictive accuracy, with R2 values of 0.884 and 0.908 and MSEs of 12.64 and 0.071 for tensile strength and hardness, respectively. The enhanced surrogate model, integrated within the NSGA-II algorithm, effectively explored the design space and produced well-distributed Pareto-optimal solutions. The optimal formulation, comprising 1490 g PLA, 121 g SCG, and 20 g silane, yielded a tensile strength of 53.33 MPa and a Shore D hardness of 80.06, representing the best balance between strength and hardness.
Overall, the proposed XGBoost NSGA-II framework offers a scalable and computationally efficient pathway for data-driven bio-composite design, reducing experimental effort and material waste while delivering superior mechanical performance. Future work should focus on extending this framework to incorporate additional bio-fillers, real-scale manufacturing validation, and deep learning-based modeling to capture more complex material interactions and cost-performance trade-offs for industrial application. It should be noted that the proposed optimal formulation was derived from the model's prediction, and its experimental validation remains a subject for future investigation to further confirm the reliability and applicability of the developed optimization framework.
Supplementary information: data, tables and figures. See DOI: https://doi.org/10.1039/d5ra06825h.
| This journal is © The Royal Society of Chemistry 2025 |