Open Access Article
Javier Espinoza-Herrera
a,
María F. Manríquez-Garcíab,
Sofía Medina-Bermejoc,
Ailyn López-Jassod,
Juan P. Ruiz-Alcocere,
Adriana Siordiaa,
Sarah M. Veskimägia,
Nate Roethlera and
Adrian Jinich*af
aDepartment of Chemistry and Biochemistry, University of California San Diego, San Diego, USA
bInstituto Politécnico Nacional, Silao, Mexico
cUniversidad Autónoma de Baja California, Mexicali, Mexico
dInstituto Politécnico Nacional, Mexico City, Mexico
eInstituto Tecnológico y de Estudios Superiores de Occidente, Guadalajara, Mexico
fSkaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, San Diego, USA. E-mail: ajinich@health.ucsd.edu
First published on 16th June 2026
The era of modern AI-driven representations of proteins is here, and moving fast, yet tools for their intuitive visualization and exploration lag behind. Sequence Similarity Networks (SSNs) have long filled this role for alignment-based methods, providing simple but widely adopted platforms for grouping proteins by homology. Building on this foundation, we present the Protein Language Visualizer (PLVis), a modular framework that applies existing pre-trained protein language model (pLM) embeddings, dimensionality reduction, and clustering to generate interactive maps of protein relationships. The central contribution is the PLVis repository, an online resource where thousands of reference proteomes can be compared and annotated through an accessible, interactive interface, much like SSNs became impactful not for their technical novelty but for their broad usability. We first validate that well-separated clusters in PLVis reliably capture homology information, while emphasizing caution when interpreting central “fuzzy” regions. We then illustrate the value of PLVis through case studies spanning individual protein families to full proteome comparisons across Mycobacterium and Plasmodium species. By combining methodological clarity with broad accessibility, the PLVis repository provides a low-barrier platform for exploring proteomes through the lens of language models.
Especially in the new era of AI-driven protein representations, where models are powerful but often black-box, visual tools are invaluable for exploring and interpreting large protein collections. Sequence Similarity Networks (SSNs) have long served this role and remain widely adopted.5–9 SSNs provide a simple yet effective way to display protein relationships: nodes represent proteins, and edges reflect pairwise similarity scores, which may be scaled or filtered according to the underlying metric.10 Tools such as CLANS have long supported the visualization of pairwise sequence similarities, further contributing to the widespread use of network-based representations.11 More recently, web tools like the EFI-EST have made SSNs accessible and popular for probing sequence-function relationships.12
Other statistical approaches, such as Profile Hidden Markov Models (HMMs), are also widely used to detect conserved patterns in protein sequences and classify them into families and domains.13,14 Originally developed in speech recognition and later adopted in NLP,15,16 HMMs underpin many of the curated resources biologists rely on today. However, unlike SSNs, HMM-based results are not typically designed for direct interactive visualization; instead, they are represented through sequence logos, hierarchical trees, or heat maps.17–19 Although conceptually distinct, the similarity scores produced by HMM profile searches can also be used in network-based analyses when desired. Complementing these sequence-based strategies, large-scale analyses of structure databases (e.g. AlphaFold Protein Structure Database) highlight how structure-based comparisons can also accelerate the annotation of uncharacterized proteins.20
Protein Language Models (pLMs), also adapted from NLP, are the conceptual successors of HMMs.21,22 Built on transformer architectures with attention and positional encoding,23,24 they are trained through masked-sequence prediction to learn context-dependent representations of amino acids. The resulting high-dimensional embeddings can represent both individual residues (tokens) and entire protein sequences, providing powerful features for downstream prediction and classification tasks, including structure prediction, as well as protein generation and design.23,25–27
The rise of pLMs and their rich embeddings presents an opportunity to design a new generation of interactive visualization tools for protein similarity. Here, we highlight an accessible approach: interactive exploration of two-dimensional projections of pLM embeddings. Beyond the method itself, our aim is to build a shared resource where such visualizations can be systematically applied across proteomes and protein families. By turning what could be one-off plots into a collective, searchable repository, we hope to make these representations broadly useful for researchers, educators, and even community-driven or citizen science efforts. While several studies have combined protein-language model embeddings with dimensionality reduction or embedding-space visualization to analyze individual protein families or functional subsets (e.g., via embedding trees or interactive embedding-space exploration), these efforts remain largely family-centric and do not systematically address species-wide proteome comparison.28–31 To our knowledge, no existing resource provides a taxonomically organized, interactive repository for embedding-based visualization across full reference proteomes, integrating clustering, annotation enrichment and cross-species comparative capability.
In this work, we focus on how protein language model embeddings can be explored through 2D projections. Aware of critiques of dimensionality reduction in other fields, particularly single-cell genomics,32–36 we cautiously assess how well these projections preserve sequence relationships. We find that well-separated clusters consistently retain homology information, whereas large, central “fuzzy” regions require caution. Next, through case studies, we illustrate how such projections can support exploratory analysis across scales, from individual protein families to full proteomes, highlighting examples from Mycobacterium and Plasmodium species. Finally, we introduce the PLVis repository, an interactive resource covering thousands of reference proteomes, alongside a Colab notebook that enables researchers to apply the pipeline to their own data. Together, these contributions establish the PLVis repository as a low-barrier platform for comparative proteomics in the era of protein language models. As with SSNs, the strength of PLVis lies not in methodological novelty but in accessibility and scale: by turning language model embeddings into reusable visual resources, we aim to make them as widely useful for the protein community as SSNs have been for sequence alignments.
For the analyses in this study, we drew on five datasets: 10
000 radical SAM (rSAM) enzymes, a set of sterol-binding proteins, the full proteome of M. tuberculosis, eight Mycobacterium proteomes, and five Plasmodium proteomes. To compare dimensionality reduction methods, we generated projections with UMAP, t-SNE, and TriMAP under default parameters (Fig. S1 and S2) and assessed clustering quality using the Davies-Bouldin Index (DBI) and Calinski-Harabasz Index (CHI). DBI quantifies cluster quality by comparing within-cluster dispersion to between-cluster separation, whereas CHI measures the ratio of between-cluster to within-cluster dispersion. For the smaller datasets, UMAP consistently produced more compact and well-separated clusters, while for proteome-scale datasets, its performance was intermediate. We further examined UMAP hyperparameters (e.g., min_dist, random seeds) and found that the overall clustering patterns and case study conclusions remained robust (Fig. S3). Given this balance of performance and interpretability, we selected UMAP as the default method for subsequent analyses. Lastly, because PLVis is designed as a visualization and exploration tool, clustering is applied to the two-dimensional projections rather than to the original embedding space, ensuring that cluster boundaries correspond to visually separable regions in the final map; this distinction between high-dimensional neighbors and 2D cluster structure is illustrated in Fig. S4. For consistency across datasets, the number of clusters (K) was selected by maximizing the silhouette score under default parameters, providing an automated way to identify visually coherent groups.
Next, we compared PLVis to the standard BLAST-based approach, the Sequence Similarity Network (SSN). SSNs are widely used to explore sequence–function relationships by grouping proteins into connected clusters based on pairwise alignment scores.9,42–44 A key feature of SSNs is their reliance on a user-defined threshold: at high cutoffs, many proteins appear as isolated nodes, whereas at lower cutoffs more connections are drawn but functional specificity may be lost. To examine how this thresholding compares with PLVis, we analyzed both the 10
000 randomly selected radical SAM (rSAM) enzymes and sterol-binding protein datasets using both approaches (Fig. 2).
Fig. 2A1 and B1 shows that the embeddings of both datasets form distinct clusters in two-dimensional space. We selected the five densest clusters from the SSNs (Fig. 2A2 and B2) and examined their correspondence within the PLVis projections, which were partitioned using K-means clustering, yielding 88 clusters for the rSAM dataset and 48 clusters for the sterol dataset. Overall, SSN clusters were largely conserved within the PLVis projections. For instance, SSN cluster 4 in the rSAM comparison was fully contained within PLVis clusters 7 and 57, while SSN clusters 3 and 5 in the sterol-binding comparison were each confined to a single K-means cluster (21 and 32, respectively). A salient characteristic of the SSNs is the large fraction of proteins that appear as isolated nodes at the selected similarity threshold, comprising 1932 proteins (∼20% of the rSAM dataset) and 4420 proteins (∼80% of the sterol dataset). In contrast, within the PLVis representations, these proteins were redistributed across 75 of the 88 K-means clusters in the rSAM dataset and across all clusters in the sterol dataset, thereby integrating them with related sequences rather than leaving them disconnected.
To evaluate the extent to which PLVis clusters capture biologically meaningful information, we performed enrichment analyses on all K-means clusters using a hypergeometric test for both datasets. For each cluster, the most frequent InterPro annotations (Family, Domain, and Other) were assessed relative to their background frequencies in the full dataset. Among clusters with available annotations, 96% were enriched for an InterPro “Family” term, 74% for “Domain”, and 68% for “Other” in the rSAM dataset, while corresponding values for the sterol dataset were 95%, 90%, and 100%, respectively. Importantly, 93% of the 1932 proteins that appeared as isolated nodes in the rSAM SSN were assigned to PLVis clusters enriched for an InterPro “Family” annotation. For example, PLVis cluster 45 grouped 18 previously unconnected proteins with sequences annotated as belonging to the TatD-associated rSAM family (IPR023821). Likewise, in the sterol-binding representation, PLVis cluster 16 grouped 143 SSN isolated proteins that share a conserved site (IPR020904) within the short-chain dehydrogenase/reductase family (IPR002347). These results illustrate how PLVis can recover meaningful groupings for proteins that SSNs leave disconnected and often excluded from downstream analyses.
To further assess whether PLVis clusters align with curated homology and orthology classifications, we performed systematic enrichment analyses against CATH FunFams and OrthoDB ortholog groups across all datasets (Fig. S5). For each k-means cluster, enrichment was evaluated using a hypergeometric test with Benjamini–Hochberg correction. OrthoDB annotations showed widespread enrichment across clusters in all datasets, indicating strong consistency between embedding-derived groupings and curated ortholog assignments. In contrast, enrichment for CATH FunFams was more variable, reflecting the more limited coverage of structural annotations for large and diverse enzyme families such as rSAM. Together, these results support the use of PLVis as an exploratory framework for organizing proteins in a manner consistent with established homology and orthology resources, without performing explicit homology or orthology inference; however, the resulting organization is contingent on the underlying embedding model used to represent protein sequences.
We applied this same metric to pLM-based projections to quantify how much clusters preserve neighborhood information. Guided by interactive exploration of annotations across clusters, we hypothesized that well-separated clusters, compact groups that are visually distinct from large, central “fuzzy” regions, more faithfully preserve local relationships in the ambient embedding space, and that this distinction could be quantified rather than assumed. To evaluate this, we analyzed five UMAP datasets, whose properties are summarized in Table S1.
To quantify cluster separation, we calculated silhouette scores for each protein at a fixed number of clusters. We tested thresholds from 0.5 to 0.95, and found that higher cutoffs consistently enriched for clusters with stronger agreement between 2D projections and embedding-space similarity. For clarity, we used 0.95 to illustrate the most stringent case, but the same trend was observed across thresholds (Fig. S6) (shown in blue in Fig. 3). For each protein, we then calculated the Jaccard distance between its neighbors in the ambient embedding space and in the 2D projection. In our analysis (Fig. 3 and S7), well-separated clusters had significantly lower Jaccard distances than other clusters (p < 0.001, Mann–Whitney U). As a complementary measure, we compared cosine similarity of high-dimensional embeddings within clusters, which was also significantly higher for well-separated clusters (p < 0.001), reflecting that they contain more closely related proteins.
Among the datasets, the eight Mycobacterium proteomes showed the clearest effect, with the highest number of well-separated clusters and the strongest agreement between 2D projections and embedding-space similarity. This likely reflects the presence of many conserved protein families shared across closely related species, which are expected to form tight and well-defined groups in embedding space. Importantly, this pattern emerges most clearly in cross-species comparisons, where proteins annotated to the same conserved families cluster together and separate from lineage-specific proteins. Such comparative contexts make it particularly straightforward to visually and quantitatively explore conserved protein families across proteomes. Although these trends are expected given the properties of non-linear projections, explicitly quantifying them provides practical guidance for interpreting PLVis visualizations, clarifying when visual separation reflects meaningful embedding-space structure and when it does not.
We then sought to validate the well-established fact that inter-cluster distances in non-linear projections are not particularly meaningful by evaluating whether nearest neighboring clusters have more similar pLM embeddings than randomly selected clusters. Given that non-linear dimensionality reduction techniques like t-SNE and UMAP warp the shape of the data when projecting to lower dimensions, distances between clusters of data points should not be interpreted directly. Using the previously mentioned datasets, we calculated the average cosine similarity for the embeddings of proteins within each cluster and compared it to the inter-cluster cosine similarity with (1) the nearest neighboring cluster and (2) a randomly selected cluster (Fig. 4 and S8).
In Fig. 4 and S8, the violin plots illustrate how cosine similarity varies as we move from proteins within the same cluster to those in the nearest neighboring clusters and finally to random clusters, highlighting trends for both well-separated and poorly-separated clusters. For all datasets, cosine similarity is notably highest within well-separated clusters, aligning with previous observations on local similarity, while poorly-separated clusters show a more gradual decline. We used Cohen's D, an effect size measure that quantifies the magnitude of differences between two distributions (values around 0.2 are considered small, ∼0.5 medium, and ≥ 0.8 large), to assess two comparisons: (1) intra-cluster versus neighboring-cluster similarity scores, and (2) neighboring-cluster versus random-cluster similarity scores. These comparisons were performed separately for both well-separated and poorly separated clusters. We also report Mann–Whitney U test results (Table S2), which provide a non-parametric measure of statistical significance; together, the two metrics capture both the size and reliability of the observed effects. When comparing proteins to their nearest neighboring cluster, well-separated clusters showed a sharp decrease in similarity relative to within-cluster values, whereas poorly separated clusters showed a more modest decrease (e.g., Cohen's D = 2.6 vs. D = 0.54, Fig. 4B). This result delineates a clear contrast between regions of the projection where cluster boundaries correspond to embedding-space structure and regions where boundaries are less meaningful. When comparing to a random cluster, the drop in similarity was smaller but more pronounced for poorly separated clusters (e.g., D = 2.0 vs. 0.4, Fig. 4D). This suggests that proteins in poorly separated clusters retain some similarity with nearby clusters in the same cloud. Thus, while absolute distances in 2D should not be overinterpreted, the spatial arrangement of clusters does preserve aspects of the underlying embedding space. While the dimensional reduction serves primarily as a visualization tool, these patterns offer additional context for interpreting both local and global relationships between protein sequences in the visualizations.
In Fig. 4, the violin plots illustrate how cosine similarity varies as we move from proteins within the same cluster to those in the nearest neighboring clusters and finally to random clusters, highlighting trends for both well-separated and poorly-separated clusters. For all three datasets, cosine similarity is notably highest within well-separated clusters, aligning with previous observations on local similarity, while poorly-separated clusters show a more gradual decline. We used Cohen's D to measure the effect in two comparisons: (1) between intra-cluster similarity scores and neighboring-cluster similarity scores, and (2) between neighboring-cluster similarity scores and random-cluster similarity scores. These comparisons were performed separately for both well-separated and poorly-separated clusters. When measuring similarity with the neighboring cluster, proteins belonging to well-separated clusters show a significant drop in the mean, which is not as noticeable when observing the poorly-separated clusters. On the other hand, similar behavior can be observed as we move farther away from the cluster and measure the similarity of proteins with those in a random cluster, but this time, the proteins situated in a poorly-separated cluster show a more significant drop when compared to proteins in well-separated clusters. This implies that sequences in poorly-separated clusters, located in the “fuzzy”, cloud-like aggregation of clusters, share a higher similarity with their surrounding proteins in the cloud-like formation. This pattern suggests that the spatial relationship in the final representation maintains some meaningful reflection of the underlying data structure, even though the absolute distances should not be interpreted directly. While the dimensional reduction serves primarily as a visualization tool, these patterns offer additional context for interpreting both local and global relationships between protein sequences in the visualizations.
We first generated a PLM embedding visualization for a subset of species from the genus Mycobacterium, a group of over 190 Gram-positive bacterial species belonging to the Actinobacteria phylum. These species range from relatively harmless organisms like M. smegmatis to dangerous human pathogens like M. tuberculosis and M. leprae.45,46 These bacteria were traditionally classified by their growth rate (slow or rapid), and recent taxonomic revisions have divided them into five distinct genera: Mycolicibacterium, Mycolicibacter, Mycolicibacillus, Mycobacteroides, and Mycobacterium.47 To demonstrate the value that PLVis projections have in comparing proteomes across organisms, we analyzed and visualized the dataset containing the proteomes of eight Mycobacterium species: M. smegmatis, M. fortuitum, M. kansasii, M. marinum, M. leprae, M. tuberculosis, M. bovis, and M. intracellulare (shown in Fig. 5).
A key insight from visually comparing proteomes across related organisms is the ability to quickly identify which protein families are enriched or expanded in each organism. We thus performed a hypergeometric test with the Benjamini–Hochberg false discovery rate correction to identify the clusters enriched for a single organism. Out of the 1581 k-clusters, 184 (∼12%) are enriched and are colored according to their respective organisms in Fig. 5. We found that the three clusters with the lowest FDR-corrected p-value (clusters 127, 536 & 857) all contained proteins belonging to the PE-Polymorphic GC-Rich (PE-PGRS) family. These proteins are glycine-rich with multiple GGA/GGN repeats and contain a PE domain near the N-terminus of the sequence as well as a high guanine and cytosine (GC) content of approximately 80%.48–50 Cluster 857 contains five glycine-rich “uncharacterized” proteins, one of which (A0A7G1IER6) fulfills all previously mentioned qualities (PE domain, GGX motif, and GC content) of a PE-PGRS family protein. Furthermore, all three clusters were not categorized as well-separated, suggesting that they might be closely related to their neighboring clusters, which is further validated by their positions. Both clusters 127 and 536 are close together and linked with cluster 1389, another enriched cluster with PE-PGRS proteins. Cluster 857, although situated on the other side of the projection, is also surrounded by clusters enriched for PE-PGRS family proteins belonging to M. marinum (clusters 149 & 1511). This observation is consistent with recent evolutionary analyses showing that the PE-PGRS family is not a homogeneous group but contains subfamilies and specialized members with distinct roles in mycobacterial pathogenesis and host interaction.51
To understand why the previously mentioned clusters were positioned in separate parts of the visualization, we obtained the AlphaFold structures for proteins belonging to the PE-PGRS clusters and calculated the TM-score between them. The comparison was made between proteins belonging to the same cluster, and random pairs from different clusters were selected as control, shown in the violin plots of Fig. 5. A close look at the plots for each cluster reveals that the visualization separated the protein family according to similarities in their structure, with clusters 127 and 536 having most of their scores above the 0.5 threshold. Cluster 857 shows lower scores, but a closer look at the structures in the cluster shows that they have a long disordered region near the C-terminus, which could have impacted the structural comparison. Such structural partitioning aligns with reports that individual PE-PGRS proteins have diverged to acquire specialized functions, suggesting that the separation observed here may reflect true biological heterogeneity within this protein family.51 We thus infer that the projections can separate proteins belonging to the same family according to their structure, which poses a significant advantage when looking for protein analogs to be used in experimental procedures. However, we reiterate that the distance between both groups of clusters is not a measure of their similarity.
Next, we analyzed the Plasmodium genus, consisting of protozoan parasites that require a vertebrate and an invertebrate host to complete their life cycle.52 This genus is medically significant as it contains the parasitic species that cause malaria, a vector-borne infection. Five species within this genus are known to infect humans: P. falciparum, P. malariae, P. ovale, P. vivax, and P. knowlesi.53 Similarly to the previous study, we visualized a dataset containing the proteomes of these five parasites, which is shown in Fig. 6. Compared to the Mycobacterium visualization, the Plasmodium PLVis has a larger and central poorly-separated/fuzzy region (S < 0.5). Of the 1942 k-clusters, approximately 36% were poorly-separated, compared to 14% in the Mycobacterium projection.
For this dataset, we repeated the hypergeometric test with the Benjamini–Hochberg false discovery rate correction to identify clusters enriched for a single organism, which resulted in the identification of 375 (∼19%) enriched clusters. We identified 77 enriched clusters that contained proteins exclusively from a single species, a fact that further exemplifies the greater proteomic diversity of this dataset, due to the more complex organisms shown. Because of this greater diversity, one can quickly point out regions in the projection that highlight a specific family of proteins that belong exclusively to a single species in Fig. 6. Such is the case for SICAvar proteins of P. knowlesi, Fam-L proteins of P. malariae, and RIFIN proteins belonging to P. falciparum.
It has been shown that RIFIN proteins are used by P. falciparum to evade the host immune system by binding to immune-inhibitory receptors.54 The visualization reveals that most RIFIN proteins are concentrated in three main clusters (38, 1448, and 1522), with only two RIFIN proteins found elsewhere in clusters 1582 and 1666 (A0A143ZXC7 and Q8I209). These outlier proteins are particularly interesting as they are surrounded by members of multiple protein families (REdfSA, tryptophan-rich antigen (TRAgs), and Maurer's clefts two transmembrane (PfMC-2TM) proteins) all of which, including RIFIN proteins, are associated with the infected erythrocyte's membrane.55–58 This clustering pattern suggests that A0A143ZXC7 and Q8I209 might also function as erythrocyte surface antigens or membrane proteins. Similarly, the TM-score of these outlier proteins was calculated with the proteins in the RIFIN supercluster and the clusters to which they belong, shown in the violin plots of Fig. 6. We found that the proteins don't share structural similarity with the supercluster, elucidating why they were positioned apart from the other RIFIN proteins. Yet surprisingly, they also don't share significant structural similarity with proteins within each cluster. Nonetheless, their embeddings show relatively high similarity to their corresponding cluster proteins (average cosine similarity ∼0.7), hinting at a purely functional relation to their neighbors. These observations and associated hypotheses showcase how PLVis can help interactively navigate large-scale protein datasets to reveal biologically significant patterns, while simultaneously providing valuable insights into protein function prediction and pathogen biology.
Towards capturing comparative full proteome visualizations across the tree of life, we collected all reference proteomes from UniProtKB. This collection of proteomes covers well-studied model organisms and proteomes of interest in biological research.2 Each proteome comparison is performed at the family level, providing a more balanced distribution that offers taxonomic resolution while including enough species for meaningful comparisons. For each of the available families in UniProtKB, the reference proteomes were retrieved to generate the corresponding PLM embeddings; in the case of outlier taxa with more than ten species, we selected the ten proteomes with the highest BUSCO completeness scores to ensure high-quality and representative comparisons. A total of 4695 reference proteomes are showcased across 3 domains, 3 kingdoms, 67 phyla, 165 classes, 404 orders, 901 families, and 2605 genera.
To facilitate navigation across the different taxonomic groups when first accessing the website, users are greeted with a collapsible tree view that helps them explore the available comparisons. Clicking on each taxon expands the tree, showing the next level of available taxa for each rank, showcasing the relation between the comparisons featured in the repository. The website also features search functionality for specific taxonomic ranks, allowing the rapid retrieval of specific proteome visualizations. If a match is found, the page will either present a list of relevant taxonomic families or redirect users to the associated proteome comparison page if the match corresponds to an available family.
Each page contains a list of the available species at the top, separated by “genus”, a PLVis projection of the proteomes, and an enrichment analysis table highlighting overrepresented annotations. Embedding representations for each protein sequence were generated using the ProtT5 language model, followed by dimensionality reduction with the UMAP and t-SNE algorithms. The resulting embeddings were then clustered using K-means to identify structural and functional patterns within each family. Finally, the K-means clusters were named by creating bi-grams of the most frequent words in the protein names present in that cluster.
Each visualization is colored according to the organisms shown for their quick identification, as can be seen in Fig. 7. A search bar located in the upper left corner of each projection allows users to dynamically filter the data by entering keywords. Users can search by UniProt entry ID, gene name, annotation score, and other metadata fields to highlight specific proteins of interest. By examining the visualization and identifying neighboring proteins in the projection space, users can quickly locate proteins with similar embedding representations that, as demonstrated in previous chapters, preserve meaningful functional and structural information.
An enrichment table is located below the visualization to aid users in finding functionally important clusters. Hypergeometric enrichment testing was performed on each cluster by obtaining the most common organism, gene name, InterPro (IP), gene ontology (GO), EC number, and Rhea ID, and comparing their distribution within the cluster against all other clusters. This allows users to detect features that are statistically overrepresented and potentially characteristic of specific protein groups. In Fig. 7 and S9, we show the percentage of clusters that present functional enrichment across all taxonomic families in the PLVis Repository. More than 80% of UMAP and t-SNE clusters in the repository are enriched for gene, IP, and GO information, with more than 50% of the families having more than 500 significant clusters. However, the proportion of enriched clusters is lower for EC number and Rhea ID when compared to the other features. These trends in enrichment emphasize that the protein clusters available are more strongly aligned with functional domain composition than with metabolic reaction identity. Overall, enrichment analysis demonstrates that the clustering captures biologically meaningful groupings, particularly with respect to protein domains (InterPro) and functional annotations (Gene Ontology).
While the primary strength of PLVis lies in its clustering capabilities, it's important to understand both its limitations and flexibility in practical applications. As stated before, due to the limitations of dimensionality reduction, distances in the visualizations aren't meaningful. However, this opens up opportunities for the users to have the liberty to modify cluster coordinates in their datasets, giving meaning to inter-cluster distance based on additional knowledge. For example, clusters can be spatially organized according to various biological parameters, such as gene expression patterns, protein essentiality profiles, or functional categories (e.g., positioning all redox enzymes in a specific region, or separating transcription factors, transporters, and enzymes). This flexibility in visualization emphasizes the importance of domain expertise and underscores the necessity for users to thoroughly understand both their biological data and the analytical tools at their disposal.
Beyond individual protein analysis and cluster organization, PLVis demonstrates remarkable utility in broader comparative studies. From a biological perspective, PLVis projections demonstrate optimal utility in comparative analyses of complete proteomes across different species. The resultant protein clustering patterns reveal significant biological insights, such as species-specific protein family absences or conserved patterns within taxonomic genera. This approach is particularly valuable for analyzing specific biological relationships, exemplified by host–pathogen interactions, where the visualization can identify clusters of proteins from both organisms that may be implicated in pathogenesis. Such protein clusters provide potential molecular signatures associated with disease mechanisms.
The PLVis repository offers researchers a quick visualization of reference proteome comparisons to find promising protein relationships within taxonomic families. Coupled with the available enrichment tables, the generated projections serve as starting points for deeper biological investigations and hypothesis generation. Expanding the collection of curated case studies through community collaboration could further enhance the website into a better educational and research resource. For this reason, we also make available the link to the PLVis Colab Notebook to assist users in generating their comparisons using the pipeline for the studies above. Together, the PLVis Repository and Colab Notebook provide a scalable platform for visualizing and analyzing large proteomic datasets, helping bridge the gap between massive collections of unannotated proteins and meaningful biological insights.
Supplementary information (SI): SI tables and figures. See DOI: https://doi.org/10.1039/d5dd00472a.
| This journal is © The Royal Society of Chemistry 2026 |