Data-driven approach for the prediction of mechanical properties of carbon fiber reinforced composites †

Fiber-reinforced composite materials are integral to aerospace, automotive, and military industries. In manufacturing, these composites are subjected to certain curing cycles, which are known to have a significant impact on the mechanical properties of the material. Many studies have focused on predicting these mechanical properties of composites, but environmental conditions and curing cycles are often not considered. In this work, supervised machine learning techniques are applied to experimental data obtained from the National Center for Advanced Materials Performance (NCAMP) for various unidirectional carbon fiber laminates to predict the mechanical properties of composite materials. These techniques holistically consider the eﬀects of environmental conditions and curing cycles, factors frequently overlooked in analytical approaches. Results show that recurrent neural network models can accurately predict the modulus of these materials, achieving R 2 values up to 0.98. This work establishes a statistical framework to analyze complex empirical data for advanced materials design.


Introduction
][13][14] Carbon fiber reinforced polymers (CFRPs) are of particular interest in aerospace applications: unlike metals, CFRPs do not corrode and are less susceptible to fatigue cracking.Additionally, carbon fiber yields significant weight reduction when compared to other load-bearing materials.There exist many factors in the design process that affect the strength of the resulting CFRP.A common manufacturing method involves a layup process in which individual laminae are stacked at varying angles in a repeating pattern; selecting a certain pattern can exploit desirable properties of both the polymer and fiber in multiple directions.The laminae are subject to a set of curing conditions, which include temperature, humidity, and cycle time, and these factors can also affect the strength of the resulting composite.Post curing, the surrounding environmental conditions also have a significant effect on material performance; these conditions are often studied to understand the long-term behavior of the material in varying temperatures and humidity.
The effects of curing conditions, environmental temperature, lamina orientation, and environmental moisture on CFRPs have been studied extensively.Various studies demonstrate that strength development is dependent on curing time and temperature, though no quantitative relationship has been established between these variables. 15In structural applications, bond joint strength has been shown to decrease with increased curing temperature, although ultimate strain increased with hightemperature curing; additionally, curing CFRPs at 120 1C may not improve their strength at room temperature, but it can significantly increase their ultimate tensile strength at an environmental temperature of 50 1C. 16,17Some environmental studies have demonstrated that short-term exposure to extremely low temperature (À28 1C) does not have significant effects on the flexural behavior of CFRPs, and others have shown increased flexural strength and energy absorption at lower (À60 1C, À100 1C) temperatures, 18,19 which is advantageous for high-altitude aviation applications.Extreme high-temperature exposure (177 1C) yields considerable reductions in the ultimate tensile strength of fiber-epoxy composites.A fatigue-focused study has shown increased fatigue strength with increased temperature (125 1C), and different layup patterns have resulted in different failure modes in transverse and longitudinal directions; tensile and compressive strength have been maximized along the direction of fiber orientation. 20,213][24] Ultimately, the following general conclusions can be drawn from the extensive literature at hand: polymer composite strength decreases with increasing curing temperature, increasing environmental temperature, and increasing ambient humidity, and its strength is maximized along the direction of fiber orientation.
Given these trends, researchers have developed increasingly accurate models over decades to predict the properties of CFRPs.Early models, including the rule of mixtures, 25 the Reuss model, 26 and the Halpin-Tsai equations 27 all provide relatively accurate estimates of ultimate strength and Young's modulus.While these models provide useful measures, they often make assumptions about the material that are difficult to achieve in practice such as homogeneity and perfect adhesion.In recent years, computational approaches have increased in popularity for their robustness and ability to incorporate several factors into predictions.][37][38][39] In this paper, supervised ML techniques are applied to empirical data gathered from thousands of physical material samples to predict the modulus of elasticity of novel CFRPs in various directions.The results and methods presented in this paper differ greatly from the existing literature due to the realworld holistic dataset and the efficient computational models, offering significant novel avenues for exploration.Beyond the overhead incurred in training, predicting the properties of a composite using the ML models presented in this paper only requires knowledge of a small set of variables representing the material constituents, their ordering, curing and environmental conditions, and the direction of loading.The models explored here are relatively computationally inexpensive and can be implemented to incorporate a large subset of potential materials.Moreover, this work explicitly demonstrates the potential for regressive ML techniques in the composite design space.

Data collection
The following data has been collected by the National Center for Advanced Materials Performance (NCAMP), a research program funded by the Federal Aviation Administration that collaborates with industry manufacturers to qualify advanced aerospace materials.1][42][43][44][45] Solvay MTM45-1 and Hexcel 8552 are both toughened epoxy resins.Note that the four selected materials are unique combinations of the two fibers and two resins; this selection is exploited and discussed in the following section.Properties of the material such as fiber type, resin type, curing conditions, environmental conditions, and fiber orientation are used as inputs for the models later presented.
The composite material data of interest are publicly available on the NCAMP website. 46Just over 6,500 data points are obtained across all four materials, and each point is described by of one of two fibers, one of two resins, one of five environmental conditions, one of seven curing conditions, and one of seven general mechanical tests.Each point is also described by the orientation of the fibers in each layer of the laminate.Fig. 1b provides a simplified visualization of some of the features that go into manufacturing and testing a carbon fiber specimen.The various design parameters lead to hundreds of thousands of potential combinations, each of which correspond to a unique material.It is important to note that although fiber orientation is excluded from the figure to aid visualization, it is not disregarded in the model.Including the orientation in the dataset increases the number of possible combinations to the millions.
Of these numerous combinations within the data set, each requires multiple test specimens to obtain an accurate measure of the mechanical properties in varying directions.This orientation-dependent behavior is demonstrated in Fig. 1a, which shows the relationship between modulus and strength.Clusters formed by the mechanical tests reinforce the significance of composites' direction-dependent properties.Data on specimens that have not been tested on both strength and modulus are excluded in this figure.To reduce data loss and enforce consistency, the modulus is chosen as the sole output for all models.The abbreviations for the variables used throughout this paper are defined in Table 1.Each environmental condition is specified by a temperature and humidity condition; CTD, for example, corresponds to À65 AE 5 1F in dry conditions.Note that 'wet' indicates equilibrium moisture content.Each mechanical test follows an ASTM standard; LT, for example, follows ASTM D3039.Each cure cycle follows an NCAMP Process Specification (NPS); M, for example, follows NPS 81228.[42][43][44][45]

Statistical analysis
Statistical analyses have been conducted to compare the mean and variance of material mechanical properties between the test environments.A common statistical experiment for such a study is an ANOVA, which assesses the equality of means and assumes an approximately normal distribution of data and equal variances across test groups.Prior to conducting this test, Levene's test of homogeneity of variances has been conducted to assess the equality of the variances across environments for a given mechanical property.For material properties that have satisfied Levene's test, an ANOVA has been conducted.For material properties that have failed Levene's test, Welch's ANOVA, which does not rely upon an equal variance assumption, has been conducted as an alternative to an ANOVA.
ANOVA and Welch's ANOVA are omnibus tests; they do not indicate which groups differ the most from the others, but instead show whether there exists a statistically significant difference between the means of test groups.Thus, for material properties that have yielded statistically significant results from the omnibus test, a post hoc test has been conducted to determine the test group (namely, the environment) that differs most from the others.For material properties that have satisfied Levene's test, Tukey's honestly significant difference (HSD) test has been conducted; for material properties that have not satisfied Levene's test, a Tamhane T multi comparison test has been conducted as an alternative post hoc test.All tests have been conducted with p = 0.01.

Model development
The NCAMP data provide several attributes for each mechanical test and material.To consolidate the dataset and ensure uniformity, a certain subset of features has been selected as model inputs; these include resin, fiber, cure cycle, environment, mechanical test, and laminate orientation.Most of these inputs are one-hot encoded variables; from the one-hot encoded data matrix, the correlation matrix has been generated to gain a better understanding of the relationships between these variables.These relationships are summarized in Fig. 2.More detailed correlation heatmaps are available in the ESI † (Fig. S1 and S2).As shown, modulus values correlate most strongly with mechanical tests, closely followed by laminate orientations.There also exist small correlations between modulus and resin, as well as between modulus and environment.Thus, these correlations justify the chosen inputs and model types.
This work studies the use of four ML models: ridge regression, random forests, a multi-layer perceptron (MLP) neural network, and a bimodal recurrent neural network.All inputs to the ridge regression, random forest, and neural network models are one-hot encoded discrete variables.For the bimodal model, all inputs are one-hot encoded except for laminate orientation, which is represented as an array of normalized values.Each entry in the array corresponds to the fiber angle orientation of that layer of the laminate.The arrays are of length 24, which is the maximum number of layers any sample in the dataset has; these arrays are front padded with a À1 for materials that have less than 24 layers to create uniform arrays for the LSTM branch of the bimodal model.The sole output of all models is the modulus.
Linear regression is generally used as a benchmark to compare the performance of other ML models.Due to the nature of our data splits, a regularization term is desired to desensitize trained models to predict similar but unseen material properties without overfitting.Least absolute shrinkage and selection operator (LASSO) is not desirable as it could  disregard critical features, so ridge regression is chosen as a suitable alternative.Due to the discrete nature of our dataset, decision trees are potentially a simple and efficient solution; however, they often have high variance and low bias on training data.Random forest models accommodate for this bias and variance tradeoff by building hundreds of decision trees and taking the weighted average across all of them, so random forest models are utilized in this work.Ridge regression and random forest models have all been trained, tested and compared using the Python module scikit-learn. 47rtificial neural networks are powerful functional approximators that have shown tremendous success in several applications and are thus also studied in this work.Finally, the bimodal neural network consists of a standard neural network and a recurrent neural network.Recurrent neural networks (RNNs) are a type of artificial neural network that have shown significant success in temporal settings, including text prediction, speech recognition, and translation, where individual units, such as letters or words, must be interpreted in the context of those preceding and succeeding them.Long shortterm memory (LSTM) is an RNN-based architecture that overcomes some of the disadvantages of traditional RNN implementations. 48Given the success of LSTM models in other ordering-dependent applications, 49,50 LSTMs are used to model the fiber orientations of each layer in a laminate.The laminate orientation array is passed into the LSTM component.The output from the LSTM branch is concatenated with the onehot encoded variable input and passed into an MLP network.All neural network models have been created, trained and tested using TensorFlow Keras functional API. 51The neural network model contains two dense layers with batch normalizations and dropout layers.A similar construction is used for the bimodal model, with the exception that the laminate orientation first passes through an LSTM layer, as depicted in Fig. 3.
With the collected data, two different data splits have been applied.Given a set of four similar materials, the first split leaves three materials for training and the last for training, following an approximate 3 : 1 test-train split.For example, AS4 MTM45-1 would be kept for testing, and IM7 MTM45-1, IM7 Hexcel 8552, and AS4 Hexcel 8552 would be trained on.The second split leaves two dissimilar materials with unique fiberresin combinations for training and the remaining two for testing, following an approximate 2 : 2 test-train split.For example, AS4 Hexcel 8552 and IM7 MTM45-1 would be kept for testing, and IM7 Hexcel 8552 and AS4 MTM45-1 would be trained on.This data split is different from a traditional 70 : 30 test-train split, as an entire material is left out for testing as opposed to a randomly selected set of data.For each split, the material that is left out shares similar features with the training data but in different combinations.This situation simulates the prediction of properties for an unseen but similar material for potential new designs.Additionally, it is important to note that several high-precision tests are available for each material combination in the dataset.Thus, a randomly shuffled testtrain split would not assess the ability of the model to predict new designs, as any testing point would likely have several nearly identical representatives in the training dataset.
With the data encoded and split, all models and their respective parameters are cross-validated with 3 folds on the training data using grid search across the hyperparameters.The regularization strength parameter has been searched in the range from 0.5 to 6.5 with increments of 0.25.The random forest depth parameter and the number of estimators have been searched in the ranges from 11 to 14 and 300 to 600, respectively.Regarding the deep learning models, both the standard neural network and bimodal model have the same architectures except for the initial LSTM layer that differentiates the bimodal model.Each of the deep learning models consists of two repeating blocks of dense layers with leaky ReLU 52 activation functions, dropout, 53 and batch normalization layers.The number of nodes in each layer and the Adam 54 optimizer learning rate have been searched in the ranges from 80 to 150 and 0.0001 to 0.01, respectively.The dropout layer value has been searched across the values of 0.05, 0.1, and 0.15.Lastly, the bimodal model has an additional LSTM layer as the input; for this model, the number of LSTM units have been searched in the range from 1 to 3. Both the standard neural net and the bimodal model have been trained for a maximum of 50 epochs with early stopping criteria set with a patience of 3. The final set of hyperparameters that have been used are provided in the ESI.†

Statistical analysis results
After conducting statistical analyses, the following results are obtained.For most mechanical tests and materials, the null hypothesis of equal variance for Levene's test is failed to be rejected.This indicates that generally, testing environmental conditions do not have a statistically significant effect on the variance of mechanical properties; all of the environments provide similarly consistent results.However, for most mechanical tests and materials, the null hypothesis of equal means for the omnibus test is generally rejected.This indicates that generally, testing environmental conditions have a statistically significant effect on the mean values of mechanical properties; certain environments yield stronger and weaker materials.
The results of Levene's test can be attributed to the highquality nature of the data.These samples have been developed and tested in highly controlled environments, and thus minimal variation in modulus values is expected across all environments.The results of the omnibus tests are explained by the results obtained from correlation analysis.As shown in Fig. 2, there is a small correlation between modulus and environmental conditions.Causally, this is a result of the fact that temperature and humidity affect the elastic modulus of the resin, and thus play a significant role in its macroscopic properties.It is also worth noting that Levene's test is more often failed for properties of materials manufactured with the MTM45-1 resin than it is for properties of materials manufactured with the Hexcel 8552 resin.This suggests that Hexcel 8552 demonstrates more consistent behavior across temperatures, whereas MTM45-1 is more likely to vary.
For materials manufactured with Hexcel 8552 as the constituent resin, a statistically significant difference is observed between ETW, RTD, and CTD environments; additionally, average modulus values decrease with temperature for all properties except tensile and compressive longitudinal modulus.For materials manufactured with MTM45-1 as the constituent resin, a statistically significant difference is generally observed between ETW2, ETW, RTD, and CTD environments; again, average modulus values decrease with temperature for all properties except longitudinal and unnotched properties.An example of this trend is shown in Fig. 4.
The relationship between environmental conditions and modulus is explained by the direct relationships between humidity and modulus and temperature and modulus.CTD, a cold, dry environment, yields high modulus values; RTD, a room temperature, dry environment, yields lower modulus values; ETW, a warm, wet environment, yields the lowest modulus values.Note, however, that the overall decreased modulus is attributed to the resin, not the fiber; this explains the observation that average modulus values decrease with temperature for all properties except tensile and compressive longitudinal modulus.Longitudinal modulus is a fiberdominated property, whereas the modulus in the remaining directions is a matrix-dominated property.Thus, a decrease in modulus is observed in all directions except for longitudinal as temperature increases.

Model results
After models are trained, cross-validated and tested, the following results are obtained.Fig. 5 visualizes the training and testing results across all models using the 3 : 1 data split with IM7 MTM45-1 as the material left for testing.Additional visualizations for other splits are available in the ESI † (Fig. S3-S6).One critical observation is the striation of results across models within individual mechanical test groups, which can be attributed to the categorical nature of the input data.In turn, providing any of the models' multiple inputs with the exact same set of manufacturing conditions will yield similar outputs.This contrasts with the nature of the actual experimental data, where similar manufacturing conditions will yield slightly different results due to material imperfections, testing inconsistencies, and other stochastic influences.Additionally, there are bands within each of the test groups, each of which corresponds to different environmental conditions.This indicates that environments are either overly weighted in the model, or that the other variables are not weighted heavily enough to dissipate the banding behavior.
When discussing the influence of each input on the results of the model, it is important to interpret these results with the context of the correlation analysis presented in Fig. 2. The mechanical test variable, which dictates the direction and type of loading, displays the greatest correlation with modulus, followed closely by fiber orientation in certain directions; the environmental condition variable displays a low correlation with modulus.Mechanically, this is expected, as the elastic modulus of the material is highly dependent upon the alignment between fiber orientation and the loading direction, whereas environmental conditions contribute minimal, yet non-negligible, perturbations to the final modulus value.Given that the mechanical test and fiber orientation variables correlate with modulus far more than the environmental condition variable, it is then of considerable interest that the aforementioned banding behavior occurs along the lines of environmental condition.This behavior is attributed to the fact that while nearly all the materials are tested in all the studied environmental conditions, most materials are only constructed using a limited set of fiber orientation profiles, and the fiber orientation variable is thus interpreted by the models as an indicator of the material type.
Four different models are trained and tested on six data splits, totaling twenty-four models.The average R 2 and RMSE values across all models and data splits are 0.992 and 4.222 GPa, respectively, for the training data.The average R 2 and RMSE values are 0.934 and 12.103 GPa, respectively, for the testing data.For the 3 : 1 data split, among the four models, the bimodal model yields the most accurate results, with average testing R 2 and RMSE values of 0.953 and 10.293 GPa, respectively, whereas the standard neural network yields the least accurate results, with average R 2 and RMSE values of 0.925 and 12.940 GPa, respectively.For the 2 : 2 data split, among the four models, the bimodal model yields the most accurate results, with average testing R 2 and RMSE values of 0.949 and 10.919 GPa, respectively, whereas the standard neural network yields the least accurate results, with average R 2 and RMSE values of 0.907 and 14.737 GPa, respectively.As shown in Table 2, on average, the bimodal model outperforms the other models across all data splits; however, for certain splits, the random forest outperforms the bimodal model.The standard neural network also outperforms the bimodal model for one data split.The results of the bimodal model across various data splits are visualized in Fig. 6.The results of the bimodal model and the standard neural network visualized side-by-side are available in the ESI † (Fig. S7).
All models are capable of predicting the modulus of unseen materials considering the relatively minute difference between R 2 and RMSE values between testing and training datasets.Moreover, the models perform well across both types of testtrain splits, implying that even with sparse training data, the models continue to demonstrate their generalizability.With a higher R 2 coefficient and lower RMSE on average, the bimodal model outperforms the rest, likely because of the continuous nature of the laminate orientation input.This suggests that the LSTM successfully interprets various features of the laminate input, including the number of layers, the fiber orientation of each layer, and most importantly, the ordering of these layers, properties that were not represented by the one-hot encoded input provided to other models.However, it is important to note that this model did not significantly outperform the others, suggesting that the less computationally expensive models perform comparably well.
In the future, the benefits of machine learning in composite design can be extended to the prediction of other material properties, such as strength and Poisson's ratio.Additionally, other inputs to the model can be represented in other ways to increase generalizability.Cure cycles can also be modeled as sequential variables of temperature and pressure, and thus can be similarly provided as an input to an additional LSTM branch.Finally, in this work, data is obtained for a limited number of materials; the NCAMP database encompasses dozens of other CFRPs, and other manufacturers may provide similar information.As opposed to artificially randomized computergenerated data, experimental data describes the inherent © 2022 The Author(s).Published by Royal Chemistry Mater.Adv., 2022, 3, 7319-7327 | 7325 Fig. 5 Training and testing results for all models.Modulus predicted by model is plotted against true modulus value across all models.For these data, the IM7 Hexcel 8552 material has been left for testing.

Conclusions
The following conclusions can be drawn from the results of this work.Firstly, from the statistical explorations, there is no statistically significant difference between the variances of the moduli across environments, but there is a statistically significant difference between their means, implying that the increased modulus in certain environments can be exploited without increasing its variability.For fiber-dominated properties, namely those in the longitudinal direction, warmer environmental conditions yield higher modulus values, whereas for matrix-dominated properties, cooler environmental conditions yield higher modulus values.The results of the ML models yield the following conclusions.Firstly, all the models yield highly accurate results, demonstrating significant potential for regressive analysis of CFRPs using ML techniques; nevertheless, these models have room for improvement, especially regarding distribution prediction.Secondly, the highly accurate results across both unconventional data splits speak strongly to the models' ability to predict unseen material combinations, which has several implications for accelerating material design.Finally, these results demonstrate the success of recurrent neural network architectures in modeling the features and ordering of lamina in laminate composites.Ultimately, advancing these models will reduce the number of experiments that need to be conducted to achieve desired properties, paving the way for a faster analysis of new laminate composite designs.

Fig. 1
Fig. 1 Exploratory data analysis.(a) Maximum strain scatter plots reveal direction-dependent properties of carbon fiber composites.Box plots visualize the distributions of the properties of each mechanical test group in one dimension.Note that the IPS strength data is unavailable, so it is not shown.(b) Sankey diagram illustrates the count of different combinations of carbon fiber specimens that have been collected.It is important to note that the true design space is significantly larger than what is displayed in the figure, as fiber orientation has been removed from the figure for simplicity and size.Mechanical test variations have also been grouped for visualization purposes.

Fig. 2
Fig.2Correlation between input variables and modulus.The correlation plot reveals the correlation between each input variable and modulus.Significantly small correlations have been filtered out.

Fig. 3
Fig. 3 Bimodal recurrent neural network model.y i represents the fiber orientation angle of the ith layer of the material.The variables x 1 to x 5 represent the five variables listed in the light grey box.The circles connected by lines represent a standard MLP network.The top branch depicts how the fiber orientation is input to the LSTM, and the middle branch depicts how the remaining one-hot encoded variables are input to the first MLP network.The outputs of these separate branches are then concatenated and passed into a second MLP network to produce a property value.

Table 1
Definitions of acronyms in NCAMP dataset

Table 2 R
2 and RMSE values for modulus across all models and data splits.RMSE values are computed across all mechanical tests and are given in gigapascals.The best testing metrics for each data split are shown in bold RF train RF test Ridge train Ridge test NN train NN test RNN train RNN test stochasticity of the though mean values are predicted in this work, real-world data can be exploited to predict a more useful distribution of values instead.