MAAPE: a tool for modular evolution analysis of protein embeddings
Abstract
We present MAAPE, a novel algorithm that integrates a k-nearest neighbour (KNN) similarity network with co-occurrence matrix analysis to extract evolutionary insights from protein language model (PLM) embeddings. The KNN network captures diverse evolutionary relationships and events, whereas the co-occurrence matrix identifies directional evolutionary paths and potential signals of gene transfer. MAAPE addresses the limitations of traditional sequence alignment methods by effectively detecting structural homology and functional associations in protein sequences with low similarity. By employing sliding windows of varying sizes, it analyses embeddings to uncover both local and global evolutionary signals encoded by PLMs. We benchmarked the MAAPE approach on three well-characterised protein family datasets: the RecA/RAD51 DNA repair protein families, the form I Rubisco families and P450 proteins from oomycetes. In all cases, MAAPE successfully reconstructed evolutionary networks that aligned with established phylogenetic relationships. This approach offers a deeper understanding of evolutionary relationships and holds significant potential for applications in protein evolution research, functional prediction, and rational design of novel proteins. The MAAPE algorithm is available at GitHub repository: https://github.com/Qinlab502/MAAPE.

Please wait while we load your content...