Open Access Article
Muhammad Hasan†
a,
Shihong Chen†
a,
Mengqi Jia†
a,
Cheuk Fung Alvin Leung†a,
Wentao Xu
a,
Ka Pui Chang
a,
Chi Ming Kana,
Shanqi Yapa,
Yuan Bing
b,
Kaicheng Zhu
*a,
Xiakun Chu*c and
Haibin Su
*ade
aDepartment of Chemistry, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China. E-mail: kczhu@ust.hk
bSongshan Lake Materials Laboratory, Dongguan City, Guangdong, China
cAdvanced Materials Thrust, Function Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China. E-mail: xiakunchu@hkust-gz.edu.cn
dIAS Center for AI for Scientific Discoveries, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
eAI-X Open Lab., HKUST Shenzhen Hong Kong Collaborative Innovation Research Institute, China. E-mail: haibinsu@ust.hk
First published on 2nd June 2026
We present a statistical pipeline with two parallel procedures to analyze SARS-CoV-2 spike evolution: (1) probability sequence density analysis for probing its sequence space, and (2) leading mutations by the composite metric. This metric integrates mutation eigenvector information with pairwise couplings to outline evolutionarily significant mutations, coined leading mutations, from massive data sets. Our results reveal progressive increase in sequence mutation rates over time, alongside scaling behaviors predictive of variant emergence and evolutionary trends in spike mutation patterns. These findings characterize the mechanisms by which the spike glycoprotein acquires new mutations, offering insights into its evolutionary dynamics.
Studies of organismic or viral mutations often focus on their immediate functional effects. It is important to elucidate the phenotypic changes due to genotypic alterations, but this superficial level of investigation is insufficient to unravel the mechanisms underpinning adaptive evolution—this is where studies of mutation rates come into play.8,9 While maintaining high mutation rates increases the likelihood for viruses to accrue beneficial mutations,10 it may reduce their overall fitness because most mutations are either neutral or slightly deleterious.11–13 This dichotomy drives the evolution of mutation rates in both cellular and acellular life,14,15 and greatly influences the rate at which adaption occurs. Arising naturally from this delicate balance, previous studies have reported a correlation between mutation rates and population sizes across various organisms.8 Extending this notion to viral systems such as the SARS-CoV-2 spike glycoprotein can help uncover the determinants of their evolutionary trajectories.
Another collective feature determined by the emergence of multiple mutations is the statistics of their distributions. Many-body systems, in which many interacting components give rise to emergent behaviors, are omnipresent across the sciences. Viral evolution exemplifies such a system, with carriers of different mutations not acting in isolation but in interweaved manners to drive the population's evolution akin to chemical reaction networks.15–17 Time-resolved analysis of the mutation distributions in viruses allows our understanding of their evolutionary dynamics to be deepened. By incorporating phylogenetic methods, variant clusters can be identified from large-scale sequence datasets, through which the mutation distance between a given progeny and its nearest ancestor can be established.18,19 This approach enables the characterization of scaling behaviors in viral evolution—relationships that quantitatively capture the principles governing the organization and dynamics of complex biological systems.20–22
While a thorough understanding of both the mutation rate and distribution of SARS-CoV-2 paints a global picture of its evolution, it remains essential to study the details of its amino acid mutations. We previously proposed a method to score the evolutionary significance of mutable sites based on their individual mutation strengths, such that a representative set of key mutations can be identified from endless combinations.23 However, higher-order structures exist within complex networks:24 as the spike continues to acquire mutations in large numbers, accounting for their epistatic interactions has become necessary to clarify its evolution.25–27 A natural way to systematically organize such high-dimensional data is to pack it into multidimensional arrays known as tensors; decomposing these tensors into simpler orthogonal factor matrices often makes them more interpretable.28–31 With their marked usefulness in analyzing physical, chemical, and biological data, we believe tensor decomposition serves as the key to extract the leading factors of SARS-CoV-2 evolution.
The need to monitor the interactions between SARS-CoV-2 mutations in a higher order perspective is reinforced by the spike glycoprotein's intricate multi-constrainted biophysical landscape.27,32,33 The spike is a class I fusion homotrimer whose receptor-binding domain (RBD) exists in a metastable pre-fusion state that transitions dynamically between a receptor-accessible up state and a receptor-inaccessible down state, creating a fundamental tension between human angiotensin-converting enzyme 2 (hACE2) binding affinity and thermodynamic stability.2,32,34 Deep mutational scanning has shown that many mutations that enhance hACE2 affinity simultaneously destabilize the protein, forcing the virus to navigate a narrow fitness ridge.32,35 In variants such as B.1.617.2 (Delta, δ) and Omicron (o), experimental structural dynamics studies have revealed that mutations directly alter RBD motion and equilibrium distribution between conformational states,36 while molecular dynamics simulations have further shown that these effects are modulated by synergistic interactions with surface glycans.37 Consequently, variant fitness is not a simple sum of individual mutational effects but an emergent property of the spike's global physicochemical profile, where epistatic interactions stabilize otherwise deleterious but functionally advantageous mutations.27
Here, we introduce a bifurcated statistical pipeline for deciphering the temporal mutation pattern of the SARS-CoV-2 spike glycoprotein (Fig. 1). To begin with, our probability sequence density (PSD) P(d) analysis characterizes the scaling behaviors in viral evolution. In this process, log
P(d) − log
d slope of each monthly data set is computed, and its time gradient can serve as an early warning signal for variant emergence. On the other hand, our leading mutations by the composite metric (LMCM) method integrates sequence, site, and amino acid data from Tucker decomposition of a mutation tensor, and pair correlation information of mutations, to outline a set of evolutionarily important mutations, coined leading mutations, every month. Each set of LMCM-derived leading mutations is publicly accessible on our online platform at https://hbsulab.github.io/deLemus/. From these results, we present a comprehensive analysis of the evolutionary dynamics of the SARS-CoV-2 spike glycoprotein by studying temporal heterogeneities within its mutation patterns, shedding light on key mutation trends of the spike glycoprotein.
P(d) − log
d slope indicative of evolutionary regime shifts. Complementarily, to account for epistatic interactions among rapidly accumulating mutations, we constructed monthly mutation tensors from multiple sequence alignments and performed Tucker decomposition to extract dominant latent features. Using our leading mutations by the composite metric (LMCM) method, these tensor-derived mutation strengths were then integrated with pair-coupling correlations to identify leading mutations on a monthly basis.
To track the evolutionary trajectory of the SARS-CoV-2 spike glycoprotein, we compiled a longitudinal dataset spanning four years of the pandemic. By December 2023, more than 15 million amino acid sequences were retrieved from the GISAID hCoV-19 database.6 We employed the EPI_ISL_402124 sequence as the baseline reference for alignment.40 To isolate distinct evolutionary signals, minimize computational redundancy, and lessen sampling biases,41,42 we implemented a rigorous filtering protocol to remove identical sequences. This resulted in a refined dataset of 667
213 unique, distinguishable sequences for analysis. Sequences were further partitioned into temporal cohorts based on their submission month, to which multiple sequence alignment was applied using Clustal Omega,43 and phylogenetic lineages were assigned using the Nextstrain framework.19 To distinguish novel evolutionary events from baseline noise, we identified substitutions and deletions by comparing each sequence to its most recent common ancestor.
We characterized the evolutionary dynamics using two primary quantitative measures of mutation rate. The first measure is the sequence-level mutation rate Ξ in the units of āā seq−1 mo−1, calculated as the mean number of novel mutations per sequence for each variant in a given month. Novel mutations were identified through a comparative analysis with the genetically closest ancestor from preceding months. To ensure statistical rigor, we calculated the corresponding 95% confidence intervals for each major variant utilizing the Student's t-distribution to account for sample variance. The second measure is the site mutation rate in the units of āā site−1 mo−1, which evaluates selective pressure at the residue level. For each variant, this was determined by calculating the mean novel mutation count per site in the variant's final month of observation, subtracting it from the variant's first month of emergence, and normalizing it by the duration of the variant's circulation. This site-specific rate was subsequently compared against the total number of unique sequences to assess the relationship between population-level reporting and residue-level divergence.
Furthermore, we investigated the global distribution of mutations through the PSD analysis, P(d), where d represents the number of novel mutated sites across each sequence. By examining the relationship between P(d) and d in a log–log coordinate system, we identified a persistent power-law scaling relationship (Fig. S1). The slope of this scaling—derived from the log
P(d) − log
d relationship—was monitored on a monthly basis to detect shifts in the virus's evolutionary regime. Details regarding the statistical implementation and the goodness-of-fit assessments of the PSD analysis are provided in Section I of the SI.
To capture the complex interdependency between mutation sites and amino acid types, we represented the multiple sequence alignment data as a high-dimensional mutation tensor. For each month t, we constructed an m × l × a0 tensor
t = (Hijkt), where m is the number of non-degenerate sequences, l is the sequence length fixed at 1273 sites, and a0 is the number of amino acid types including deletions. An entry Hijkt = 1 signifies the presence of mutation of type k at site j in sequence i. We factorized this high-dimensional structure by Tucker decomposition, defined algebraically as
![]() | (1) |
contains the weight coefficients, or eigenvalues, for these interactions. From these components, we constructed a mutation strength matrix Mt whose entry identifies the amino acid substitution of a specific site that exerts the strongest influence on sequence divergence.
Evolutionary success is often driven by the cooperation of mutations rather than isolated ones.26 To quantify these dependencies, we constructed a pair-coupling matrix Ct = (Cijt) based on the correlation of mutation frequencies:
![]() | (2) |
denotes the joint frequency of mutation types a and b at sites i and j, and
,
denote their independent frequencies respectively. Through singular value decomposition, we isolated the primary pair-coupling matrix
from Ct, which represents the most significant signals of cooperations between mutation pairs.
Finally, we established the LMCM framework which ranks mutations based on a composite score
that integrates individual mutation strength
with its network of couplings
:
![]() | (3) |
, and the sensitivity of the Tucker decomposition to rank choice are described in Section II of the SI (Fig. S2–S4). By identifying mutations with the highest
scores, we can forecast leading mutations that are biophysically primed for selection in future variants. This dynamic ranking is updated monthly and served via our open-access platform at https://hbsulab.github.io/deLemus/ to assist the global research community in early variant detection.
We observed an increase in the sequence-level mutation rate Ξ of the spike glycoprotein over the course of the pandemic (Fig. 2). Early pre-Omicron variants demonstrated relatively low and stable mutation rates, with mean values for D614G (1.18 ± 0.05 āā seq−1 mo−1) and Alpha (1.18 ± 0.09 āā seq−1 mo−1) highlighting the limited sequence divergence at the onset of viral transmission.45 Subsequent variants of the Omicron family exhibited elevated Ξ values, with mean rates ranging from 1.41 ± 0.26 āā seq−1 mo−1 for BA.4&5 to 1.55 ± 0.47 āā seq−1 mo−1 for BA.1, indicative of accelerated evolutionary dynamics and increased mutational variance.
![]() | ||
| Fig. 2 Mutation rates Ξ of SARS-CoV-2 spike glycoproteins across major variants, expressed in units of āā seq−1 mo−1. Variants are ordered top to bottom according to their designation date by the World Health Organization.44 Inset: Correlation between site mutation rates and effective population size for variants of concern (R2 = 0.853, n = 6), expressed in āā site−1 mo−1. Other abbreviations: α – B.1.1.7 (Alpha), β – B.1.351 (Beta), γ – P.1 (Gamma), δ – B.1.617.2 (Delta), o – Omicron. | ||
Furthermore, we observed a robust positive correlation between the site-specific mutation rates and the genetic diversity of major SARS-CoV-2 variants. These include B.1.1.7 (Alpha, α), B.1.351 (Beta, β), P.1 (Gamma, γ), B.1.617.2 (Delta, δ), and Omicron (o) (Fig. 2 Inset). Direct frequency-based analyses are often susceptible to sampling biases, and certain sequences can be either heavily underrepresented or overrepresented due to discrepancies in sequencing intensities.41,42 We therefore quantified genetic diversity using the count of unique sequences, which mitigates oversampling by tallying each distinct sequence only once.
Notably, highly transmissible variants like Delta and Omicron exhibited higher site mutation rates of 0.29 āā site−1 mo−1 and 0.47 āā site−1 mo−1 respectively, as well as greater within-variant genetic diversity.46,47 Positive coupling between these traits is expected but mechanistically and causally nontrivial. One interpretation is that a high substitution rate helps maintain genetic diversity within a viral population, which promotes adaptation to optimize its fitness characteristics such as transmissibility.48,49 On the other hand, more transmissible variants could lead to greater generational turnover and thus greater substitution rate and genetic diversity via rapid parallel infection of the host population.50 In contrast, the geographically constrained Beta and Gamma variants displayed comparatively lower site mutation rates and genetic diversities.51
From the PSDs of the monthly non-degenerate sequence sets, we observed a linear scaling relationship between the quantities P(d) and d, which manifested itself across different time frames of the SARS-CoV-2 pandemic (Fig. 3a). This scaling was highly robust, and was confirmed by the systematic examination of the log
P(d) − d plots for every month from January 2020 to December 2023 (Fig. S5). March 2020, October 2020, and May 2021 were selected as representative snapshots because they illustrate the temporal evolution of the scaling slope during the first major saltation event: transitioning from an early random-mutation regime through the emergence of Alpha and Beta variants and culminating in the rapid global dominance of Delta shortly thereafter.54
![]() | ||
Fig. 3 (a) Scaling relationship between the probability sequence density (PSD) P(d) and number of mutations d in selected months (RMar202 = 0.903, n = 10; ROct202 = 0.899, n = 9; RMay212 = 0.992, n = 11). (b) Evolution of the time gradient Δ(log P(d)/log d)/Δt. Months wherein notable SARS-CoV-2 strains were designated by the World Health Organization as variants of concern, interest, or under monitoring are annotated with red, green, and orange arrows respectively.44 Other abbreviations: α – B.1.1.7 (Alpha), β – B.1.351 (Beta), δ – B.1.617.2 (Delta). | ||
The effects of this scaling are twofold. First, the consistently negative slope of the log
P(d) − d plot indicates that spike sequences with lower mutation loads prevailed over those with more mutations in each monthly ensemble. Specifically, most SARS-CoV-2 spike glycoproteins were seen to harbor around one to three mutations per month (Fig. 3a), corroborating previous estimates;55 spike sequences surpassing this range are exceedingly rare. Second, such a scaling relationship can be described as a power law, from which network-like interactions between individual SARS-CoV-2 variants can be inferred.20,52 Being a type of heavy-tailed distribution, a power law distribution creates a heightened likelihood of large-valued events than a normal distribution.22,53,56 This suggests that the evolution of the SARS-CoV-2 spike glycoprotein is a process highly susceptible to abrupt and massive mutational events, likely coming from immunocompromised patients with prolonged infections.57,58 The three major evolutionary leaps sustained by the virus, from Alpha to Delta, from Delta to BA.1, and from XBB to BA.2.86, serve as prime examples that demonstrate how a few extreme events could disproportionately impact the global SARS-CoV-2 population.54,59,60 Studies have suggested that such a burst of mutations equips saltation variants with enhanced immune evasion capabilities,60,61 or helps them navigate through entrapped regions of the fitness landscape epistatically,54,58,62 which may have led to their global predominance.
We further examined the dynamics of the scaling relationships between P(d) and d, from which we discovered an intriguing link between the time gradient of the log
P(d) − log
d slope, Δ(log
P(d)/log
d)/Δt, and SARS-CoV-2 spike evolution. In general, log
P(d)/log
d can be viewed as a quantity describing the shape of the PSD, so its time derivative enables us to track time-dependent changes in the distribution of spike sequences based on their d values. For example, a reduction in this ratio corresponds to a shift toward a sequence ensemble dominated by low-d sequences, whereas an increment corresponds to the opposite. Under this interpretation, large positive peaks in a Δ(log
P(d)/log
d)/Δt-time plot would represent a sudden influx of heavily mutated spike sequences within the virus population, which would signify the emergence of a new variant. In alignment with our hypothesis, while multiple peaks and troughs can be observed from Fig. 3b, the emergence of most major SARS-CoV-2 variants is accompanied by a positive peak in the time gradient. Therefore, this finding suggests the importance of PSD analysis in understanding the evolution of viruses, showing it can be used for the early detection of variant displacement events.
![]() | ||
| Fig. 4 Top-ranked LMCM-outlined leading mutations (LMs) in the receptor-binding domain (RBD) of the spike. Variants designated as variants of concern or interest in the same month by the World Health Organization are grouped together.44 Each pie chart represents the distribution of top-ranked LMs for a given month, where each colored sector corresponds to a specific mutation and its relative frequency within that month's top-ranked set. Vertical lines connect LMs with reported mutations if they coincide. The color scheme corresponds to the side-chain property of the residue being mutated to: Hydrophobic (orange; A, I, L, V, G, P, C, M), polar (magenta; H, N, Q, S, T), positive (red; K, R), negative (blue; D, E), aromatic (lime; F, W, Y), and deletion (gray; Δ). The complete set of RBD LMs can be found in Fig. S6 and S7. | ||
Several key RBD mutations in SARS-CoV-2 have emerged across different variants, significantly impacting hACE2 binding affinity; notably, the LMCM method consistently identified these dangerous substitutions months before they achieved global dominance. The N501Y mutation, first identified as a leading mutation in June 2020 using the LMCM method—well before the surge of Alpha—is present in the major variants Alpha, Beta, Gamma, and Omicron. While N501Y has been shown to enhance RBD stability and hACE2 binding from Alpha through Gamma via mechanisms like π–π stacking with the receptor's Y41 residue,63–65 its advantageous effects are context-dependent and less pronounced in Omicron,66 underscoring the importance of epistatic interactions between N501Y and its surrounding residues.36 Similarly, the E484K mutation, captured by our method as early as January 2020 and notably found in the Beta variant, promotes stronger hACE2 binding via local structural rearrangements when combined with N501Y and D614G.67 The T478K mutation, highlighted by LMCM in November 2020 as a future mutation of concern and later found characteristic of the Delta variant, facilitates tighter receptor binding by forming additional hydrogen bonds at the RBD-hACE2 interface.68 Another mutation associated with Delta, L452R, was identified by our pipeline in April 2020, and contributes to increased hACE2 affinity through secondary structure rearrangements with the aforementioned E484K and N501Y.67,69 Finally, the N440K mutation, outlined by LMCM as an evolutionarily significant site, enhances electrostatic complementarity with hACE2, aiding its high transmissibility.70
The time-resolved spectra of the LMCM-derived leading mutations bridge the gap between the global evolutionary trends in sequence ensembles and local functional alterations in the spike protein, allowing for the early detection of fitness-enhancing amino acid substitutions. One notable trend we observed was the accumulation of basic or polar amino acids in the RBD of the spike (Fig. S6). This convergence in physicochemical properties of substitutions was particularly pronounced within the receptor-binding motif (RBM) of the RBD (Fig. S7 and S8), an essential stretch of amino acids that engages the hACE2 receptor to mediate viral entry into host cells.71 While the enrichment of positively charged or polar residues in the RBM tends to improve spike-receptor interactions due to the predominantly negatively charged surface of the hACE2 receptor,70,72 the convergence ratio—defined here as the monthly fraction of newly appearing leading mutations among all RBM-based leading mutations—revealed a plateau in this trend over time (Fig. S8 and S9). This saturation, identified through our leading mutation analysis, may suggest a potential evolutionary trade-off between the charge optimization of the RBM for better infectivity and the immune evasion capability of the virus.73 In fact, recent studies have reported that newer Omicron subvariants exhibit diminishing gains in hACE2 binding affinity when compared to their predecessors,74,75 a shift in evolutionary strategy that can be anticipated by monitoring the divergence in LMCM spectral outputs. By identifying these evolutionarily significant sites months before they manifest in dominant global variants, the LMCM framework demonstrates robust capability to predict the physicochemical trajectory of SARS-CoV-2 spike evolution.
P(d) − log
d slope, capable of quantifying these abrupt shifts, as an early indicator for the emergence of evolutionarily important variants. Moreover, at the single amino acid level, our LMCM method bridges the global sequence ensemble dynamics with site-specific functional impacts of individual mutations by outlining crucial leading mutations within massive sequence data sets; its results are posted in our publicly accessible platform at https://hbsulab.github.io/deLemus/. Overall, the findings advance our understanding of viral evolution and demonstrates the need for extensive disease surveillance to better understand the evolutionary dynamics of circulating viruses.
Supplementary information (SI) include the tensor decomposition method and pair-coupling matrices; evolutionary trend of leading mutations outlined by the composite metric; physiochemical properties of leading mutations in reported variants, etc. See DOI: https://doi.org/10.1039/d5cp04811g.
Footnote |
| † These authors contributed equally to this work. |
| This journal is © the Owner Societies 2026 |