Prediction of Enzyme function using interpretable optimized Ensemble learning framework
Abstract
Accurate prediction of enzyme function, particularly for newly discovered uncharacterized sequences, is immensely important for modern biology research. Recently machine learning (ML) based methods have shown promises. However, such tools often suffer from complexity in feature extraction, interpretability, and generalization ability. In this study, we construct the dataset for enzyme functions and present an interpretable ML method, SOLVE (Soft-Voting Optimized Learning for Versatile Enzymes) that addresses these issues by using only combination of tokenized subsequences from the protein's primary sequence for classification. SOLVE utilizes an ensemble learning framework integrating random forest (RF), light gradient boosting machine (LightGBM) and decision tree (DT) models with an optimized weighted strategy which enhances prediction accuracy, distinguishes enzymes from non-enzymes, and predicts enzyme commission (EC) numbers for mono- and multi-functional enzymes. The focal loss penalty in SOLVE effectively mitigates class imbalance, refining functional annotation accuracy. Additionally, SOLVE provides interpretability through Shapley analyses, identifying functional motifs at catalytic and allosteric sites of enzymes. By leveraging only primary sequence data, SOLVE streamlines high-throughput enzyme function prediction for functionally uncharacterized sequences and outperforms existing tools across all evaluation metrics on independent datasets. With high prediction accuracy and its identification ability of functional regions, SOLVE can become a promising tool in different fields of biology and therapeutic drug design.