Hierarchical clustering and optimal interval combination (HCIC): a knowledge-guided strategy for consistent and interpretable spectral variable interval selection
Abstract
Variable selection is crucial for the accuracy of spectral analysis and is typically formulated as an optimization problem using regression techniques. However, these data-driven methods may overlook physical laws or mechanisms, leading to the deselection of physically relevant variables. To address this, we propose a hierarchical clustering and optimal interval combination (HCIC) strategy, guided by domain knowledge, in which physical principles and mechanisms inform algorithm design to capture more physically relevant feature structures. In the first step, spectral variable hierarchical clustering (SVHC) is employed to determine correlations between adjacent variables, generating non-uniform intervals. Each interval corresponds to distinct patterns that reflect underlying molecular interactions, such as peak shifts, functional group contributions, and even non-reaction background signals. Secondly, a Bayesian linear regression-based optimal interval combination (BLR-OIC) strategy is applied to identify the most effective interval combinations, capturing and exploiting the synergistic effects among functional bands or functional groups. We conduct extensive experiments on publicly available and proprietary databases to validate the efficacy of the proposed algorithm. The results demonstrate not only improved predictive performance compared to benchmarks but also greater interpretability and consistent variable selection.