TransG4: an interpretable deep-learning approach for sequence-based G-quadruplex prediction
Abstract
G-quadruplexes (G4) are non-canonical nucleic acid secondary structures formed in guanine-rich regions and have been shown to regulate diverse cellular processes such as gene expression, DNA replication, and telomere maintenance, with increasing evidence linking G4 to cancer and other human diseases. G4 predominantly emerge in guanine-rich regions and are implicated in a spectrum of molecular interactions and disease phenotypes, thus researchers are interested in the formation of G4. However, predicting the formation of G4 from nucleotide sequences is a persistent problem. Existing computational tools for G4 prediction are either rule-based on domain knowledge or rely on a single neural network model like a convolutional neural network (CNN), which lacks interpretability and struggles to capture long-range dependencies among bases. Here, we introduce TransG4, a novel neural network architecture that integrates a CNN, a transformer, and bidirectional gated recurrent units (BiGRUs) to identify potential G4 structures. TransG4 demonstrates strong predictive performance on both G4-seq and rG4-seq datasets, accurately predicting DNA mismatch scores and consistently outperforming existing methods in RNA RSR-ratio prediction. Attention-based interpretations further show that TransG4 captures biologically meaningful motifs consistent with canonical G4 structures, providing an interpretable and generalizable framework and representing a novel and impactful contribution to sequence-based G4 propensity prediction.

Please wait while we load your content...