Open Access Article
Cameron J. Douglas
ab and
Ciaran P. Seath
*a
aDepartment of Chemistry, Wertheim UF Scripps, Jupiter, Florida 33418, USA. E-mail: cseath@ufl.edu; cseath@scripps.edu
bThe Skaggs Graduate School of Chemical and Biological Sciences, 120 Scripps Way, Jupiter, FL 33458, USA
First published on 10th November 2025
Omics analysis has become an indispensable tool for researchers in the life sciences, enabling the study of DNA, RNA, and proteins and how they respond to cellular stimulus. Many methods of data analysis exist for the generation and characterization of gene lists, however, selection of genes for further investigation is still heavily influenced by prior knowledge, with practitioners often studying well characterized genes, reinforcing bias in the literature. Here, we have developed an open-source, R package for impartial ranking of gene lists derived from omics analysis that we term deciphering scientific discoveries (DeSciDe). We applied a pipeline that sorts a gene list first by precedence, which we define as co-occurrence of the gene with pre-defined search terms in publications. We then rank gene lists by connectivity, an underutilized metric for how related a gene is to other enriched genes. The combination of these rankings by scatterplot provides a method for gene selection by simple visual analysis. We apply this analysis method to published Omics datasets, identifying novel avenues for investigation. Further, using this method we have been able to assign a novel loss of function role for the histone mutation H2A E92K.
When studying gene lists from omics analyses, it can be valuable to search for interactions between enriched genes to aid users in assigning molecular mechanisms. These physical interactions are typically explored using STRING analysis, where genes are plotted as interconnected nodes. This tool is enabling when studying small lists of genes, but the graphical output becomes unwieldy and challenging to deconvolute when large numbers of interconnected genes are present, limiting the use of the tool when analysing lists with greater than 50 hits.
Based on these limitations, end users of omics methods have an urgent need for tools to aid in unbiased selection of gene hits for follow up studies. In considering this need we identified several requirements; (1) a method must incorporate the cellular stimulus or pathway that is being studied; (2) the method should be able to assign value to interactions between genes; (3) the program must be readily available and applicable to many different experimental types.
To address these requirements, we have developed an R package to enable the analysis of gene lists that are experimentally associated with a relevant search term. Our package, Deciphering Scientific Discoveries (DeSciDe), can incorporate any number of cellular stimuli and cross-reference them against lists of enriched genes. Based on the co-occurrence of the gene and the cellular stimulus, a hit is assigned as either strongly or poorly associated with a particular search term. Further, our system computes network connectivity metrics from the gene interaction networks provided by the STRING database, producing a quantitative evaluation of connectivity. The gene lists can then be sorted based on two key values: interconnectivity and precedence. Through analysis of existing datasets, we demonstrate that connectivity is a viable metric for selecting genes for follow up investigation, and when combined with the quantification of literature precedence, can provide a systematic and objective method for gene selection. We anticipate that such a tool will be valuable for myriad applications across the biological sciences.
Next, we sought to establish a metric for quantifying and ranking gene interconnectivity, which we suggest may be valuable and generally applicable for hit selection following omic analyses. We incorporated existing STRING interaction networks and quantified each gene according to two criteria: number of interactions (known as degree in graph theory), and connectivity (known as clustering coefficient in graph theory) (Fig. 1C). The number of interactions is straightforward, representing the number of connections made by each node. A gene's connectivity score represents the percentage of existing connections compared to the theoretical maximum number of connections in the subnetwork spanned by the respective node (the network comprised of the node and its neighbours). We then compile the gene lists and their network properties in a table that is filterable. By default, the genes are ranked first by the number of connections and then by percent connectivity. The package also produces a scatterplot showing the number of interactions versus the percent connectivity. This type of plot provides a visual representation of how interconnected the genes in the list are. Genes in the bottom left of the plot have few connections and low connectivity, whereas the top right corner contains the most connected set of genes (Fig. 1C).
The final component of the application is the combination of precedence and connectivity rankings. Here, we create a scatterplot that displays the rank order of precedence versus the rank order of connectivity. Since rank order arranges values from high to low, in this visualization, hits that have high precedence and high connectivity appear close to the origin. By default, DeSciDe classifies these genes based on a 20% threshold of total genes in the list with high-connectivity, high-precedence genes falling in the top 20th percentile of ranked genes in both lists and high-connectivity, low-precedence genes falling in the top 20th percentile of connectivity and the bottom 20th percentile in precedence. This threshold can be adjusted by the user as deemed fit for their analysis. We found this to be a broadly useful visualization for hit selection. To illustrate how DeSciDe might be used we applied our workflow to four case studies.
The plots produced by the DeSciDe analysis pipeline can be easily exported and saved for use in presentations and publications. Examples of the plots produced by default running of DeSciDe can be seen in SI (Fig. S1–S5). Additionally, the data tables of results can be saved as TSV, CSV, or Excel files for further analysis or for generating revised figures.
We began by analysing proximity proteomics data sets. We chose these as examples as they are uniquely suited to this analysis for several reasons: (1) the bait (or protein of interest bearing the proximity labelling enzyme or catalyst) is not typically included in standard analysis as it is often spatial9 or biological (i.e., a small molecule11), and (2) these experiments are typically designed to look for unknown interactions. Therefore, careful filtering of gene lists is required before choosing genes of interest for further investigation, adding an element of human bias.
First, we analysed a proximity proteomics dataset published by Geri et al. that describes the interactome of the receptor PDL1 on Jurkat cells.9 PDL1 plays an important role in the immune system as a “save me” signal, stopping immune cells from attacking healthy cells. As PDL1 is frequently expressed by cancer cells to the immune system, it has become an active oncology target.12 The purpose of this experiment was to identify novel interactors of PDL1 that may play a role in immune oncology. From the gene list provided, we filtered for all differentially enriched genes that met statistical significance (41 total) and passed them through the DeSciDe pipeline (Fig. 2A) with the keyword's cancer, immunology, and checkpoint blockade. STRING analysis of the 41 hits was informative, with 12 genes in the list having no known interactions and 24 genes having at least three known interactors within the data set (Fig. 2B). To illustrate how connectivity is calculated from STRING data, nodes surrounding FCER2 and HLA-B are shown in Fig. 2C. DeSciDe ranks these genes by connectivity (Fig. 2D), placing ICAM1, CD40, and CD274(PDL1) as the top three hits. Of these three, CD274 is the bait protein, and CD40 and ICAM1 have both been validated to colocalize with PDL1 in immune synapses and are themselves targets for immune based therapies.13–15
![]() | ||
| Fig. 2 Pipeline for analysis of a proximity proteomics dataset. (A) Graphical representation of DeSciDe pipeline for unbiased ranking of gene lists. (B) STRING analysis of PDL1 interactomics data set, published by Geri et al.9 (C) Example of how connectivity analysis is performed. (D) PDL1 interactome ranked by connectivity (top 6 genes shown). (E) Heatmap showing results of DeSciDe data mining against the search terms “cancer” “immunology” and “checkpoint blockade”. (F) Scatterplot of genes ranked for connectivity vs. precedence with suggested alternate genes for investigation highlighted in boxes. Graphs made in Prism from exported DeSciDe data. | ||
Next, using DeSciDe datamining, we cross referenced the gene list with the search terms described above and plotted via heatmap (Fig. 2E). These data clearly show that the genes within this data set have significant overlap with the search terms immunology and cancer, but fewer with the more specific term checkpoint blockade. Our application makes it trivial to identify related genes from the heatmap. The ranked lists were then combined and displayed as a scatterplot of precedence vs. connectivity (Fig. 2F). The genes of highest connectivity and highest precedence are located near the origin. In this data set, that includes the previously discussed ICAM1, CD40, and PDL1, in addition to CD70, HLA-A, and the death receptor FAS. Based on previous reports in the area and the precedence from the data mining, all the genes within this sector are confident hits.16–19 With this knowledge, it can be assumed that connectivity is a reasonable metric to rank genes. If this is correct, then moving to the top left quadrant, where connectivity remains high, but the genes have far fewer precedented reports relating to the three search terms, may provide novel targets for investigation. In this area, we found 7 genes (TNFRSF8, FCER2, LY75, CD300A, SCARB1, LILBR1) that are candidates to be novel interactors, with less known about their involvement in PDL1 based checkpoint blockade.
In our next case study, we examined a proximity proteomics dataset with a significantly more complex interactome. The experiment, published recently by the laboratories of MacMillan and Muir, described how a somatic mutation on histone H2A disrupts the nucleosome microenvironment in HEK293T cells (Fig. 3A).10 This type of hypothesis generating experiment is a good match for our data analysis pipeline as the proteomics data can lead to several areas of study and is easily influenced by inherent bias. In the original study the authors identified several enriched genes (SIRT6, DNMT3A/B, BRD2/3/4) for further investigation. When analysing this data set with our pipeline, we found that these genes were among the most well studied (by searching the terms chromatin, nucleosome, and acidic patch) ranking 2nd, 3rd, 4th, 11th, and 12th of all genes in our measure of precedence (Fig. 3B). Visual inspection of STRING analysis for this data set was not instructive due to the density of the interaction networks (Fig. 3C). However, connectivity analysis suggested a set of genes that were not implicated in the original study (Fig. 3D and E).
Plotting the ranked lists against each other proved to be enlightening, with both critical quadrants i.e., nearest the origin (high confidence genes) and the top left quadrant (highly connected but less well studied in this context), both suggesting investigation of genes relating to regulation of the cell cycle (Fig. 4A and B).
Recently, McGinty and co-workers demonstrated the role of the acidic patch in coordinating VRK1 phosphorylation of H3T3 during cell division.20 Furthermore, pathogenic mutations on VRK1 were shown to disrupt this interaction, providing a molecular basis for how these rare mutants may cause rare adult-onset distal spinal muscular atrophy. Our reanalysed data also suggest that the acidic patch may play a role in cellular division, and that the E92K mutation may lead to deregulation of this critical cellular pathway.
We further investigated this by performing cell-cycle analysis using propidium iodide in HEK293T cells stably expressing H2A or H2A E92K. We observed the mutant cell line contains a higher concentration of cells in the G2 phase, suggesting that the small proportion of mutated nucleosomes confer a subtle change in the cell cycle (Fig. 4C). Based on this stalling, we anticipated that the mutant cell line would show reduced proliferation compared to the wild-type cell line. Cell proliferation assays of both cell lines over 72 hours showed the E92K mutation significantly decreased proliferation, in line with our hypothesis (Fig. 4D). Based on these data, we can assign a new role for the E92K acidic patch mutation. The previously reported chemoproteomics data shows that the mutation disrupts interactions between the nucleosome acidic patch and cell cycle proteins AURKB, CDC6, and CyclinB1, which all play a significant role in the G2-M transition during cell division.23–25 This disruption leads to stalling in the G2 phase of cell division and subsequent reduction in proliferation (Fig. 4E). The data suggests these effects are subtle, likely due to the relatively low incorporation (approx. 1 in 16 nucleosomes), and that it is likely that only one face of each mutant nucleosome bears the mutation. However, this effect is in line with the chemoproteomics and ATAC-seq data previously reported.10 This example indicates that unbiased analysis using a program such as DeSciDe can provide different avenues of investigation that may be overlooked in favour of genes that are highly represented in the literature.
Furthermore, we demonstrate that DeSciDe is widely applicable to omics analysis by reanalysing publicly available datasets from RNA-seq, global proteomics, CRISPR screens, and ATAC-seq experiments. A recent RNA-seq experiment published by Ma et al., was deployed to study differential splicing in the context of TDP-43 deficient FTD-ALS.21 The authors chose the mRNA UNC13A for follow up studies, demonstrating its role in ALS pathology (Fig. 5A). Plotting connectivity vs. precedence using the search terms RNA-splicing, ALS, and TDP-43 suggested UNC13A as the highest confidence hit (most connected and most precedented), illustrating that this bioinformatics methodology can recapitulate complex data analyses without prior knowledge of the field (Fig. 5B). Further, based on the scatterplot, we can suggest alternative genes for investigation that either cluster around UNC13A or have less precedented associations with the search terms, which may not be obvious candidates for follow-up studies (Fig. 5C).
![]() | ||
| Fig. 5 Analysis of RNA-seq data using DeSciDe. (A) Brief outline of RNA-seq experiment performed in the study by Ma et al.21 (B) DeSciDe analysis suggests UNC13A as the highest confidence hit, recapitulating the authors analysis in an unbiased manner. (C) DeSciDe analysis can suggest alternate genes for further analysis based on novelty or confidence. (D) Aim of proteomics and transcriptomics experiments re-examined using DeSciDe. (E) Both proteomics and transcriptomics analyses via DeSciDe arrive at the same conclusion as the authors. Many high confidence genes remain unexplored within both studies.22 Graphs made in Prism from exported DeSciDe data. | ||
Finally, we analysed a 2020 study on the proteomic and transcriptomic host response to COVID-19.22 One key aim from these experiments was to identify factors that contribute to fatality following infection (Fig. 5D). From 4065 and 637 differentially expressed genes across RNA-seq and proteomic datasets, respectively, the authors highlighted expression of cathepsins as a marker for poor prognosis. DeSciDe analysis also points towards CTSB and CTSL as high confidence hits in the quadrant closest to the origin in both RNA-seq and proteomics datasets (Fig. 5E). Once again, these data suggest that analysis through the DeSciDe pipeline and using connectivity and literature datamining to rank hits is a viable and useful bioinformatics method for unbiased analysis of gene lists.
While this method of data analysis appears to be powerful for ranking genes in an unbiased way, some deficiencies remain. Connectivity is based upon and intrinsically related to precedence, where more connections have been reported for more “popular” genes, so completely uncharacterized genes will still be ignored using this analysis. Further, the search function within this application cannot filter for the most relevant journal articles, only those that contain the gene name and the manually selected search terms, so some articles may not actually show a meaningful connection. The selection of the search terms used can produce different results, so there is a burden on the user to be thoughtful of what terms they wish to employ when running DeSciDe. Additionally, this platform operates best with gene lists of >20 and <500, as small datasets result in limited hits in the quadrants of interest, and large datasets may still produce hundreds of genes in the regions of interest, decreasing the utility of the tool to narrow down hits. Therefore, curation of the omics dataset to include a list of the top statistically significant hits is important to gain meaningful insight from DeSciDe. As with any biological investigation, interpretation of data falls to the researcher. DeSciDe can help guide a researcher towards genes of interest, but ultimately they will need to investigate and validate the hits experimentally. Finally, these analyses do not include enrichment fold change, which is often a metric used for gene selection. We are currently working on methods to improve the analysis pipeline to solve some of these issues.
Code can be accessed at https://github.com/camdouglas/DeSciDe.
| This journal is © The Royal Society of Chemistry 2025 |