Prediction and visual analysis of flue-cured tobacco aroma types based on machine learning and feature derivation
Abstract
The present study aimed to investigate the relationship between aroma types and chemical properties of flue-cured tobacco (FCT), and to explore the applicability of machine learning (ML) combined with feature derivation in the FCT industry. A total of 619 Sichuan FCT samples representing three aroma types (fresh-sweet, honey-sweet, and mellow-sweet) were utilized. Feature derivation was performed based on 51 raw chemical indices, followed by a three-tier key indicator selection process incorporating separability analysis, Random Forest (RF) importance ranking, and redundant feature elimination via correlation analysis. By comparing multiple machine learning models, the optimal model adapted to the Sichuan FCT dataset was screened out. Model parameter optimization was accomplished in combination with the genetic algorithm (GA), and finally, visual interpretation of the model's decision-making mechanism was realized by means of SHAP values. The results demonstrated that after three-tier screening, 9 key characteristic indices including rutin-malonic acid, rutin and chlorogenic acid et al were finally identified. The random forest (RF) algorithm was the optimal model for this dataset; after parameter optimization, the model achieved an F1-score of 88.3% and an accuracy of 93.5%, which greatly reduced the detection cost and improved the model's discrimination performance. Additionally, the SHAP value interpretation framework clearly reveals the intrinsic correlation between chemical characteristics and aroma types. This study not only enhances the efficiency of aroma type classification for Sichuan FCT but also clarifies the key chemical indicators associated with aroma traits. It further provides quantitative support for optimizing FCT quality through the targeted regulation of key component contents.

Please wait while we load your content...