Sreejata
Dutta
a,
Dinesh Pal
Mudaranthakam
ab,
Yanming
Li
ab and
Mihaela E.
Sardiu
*abc
aDepartment of Biostatistics & Data Science, University of Kansas Medical Center, Kansas City, Kansas, USA. E-mail: msardiu@kumc.edu
bUniversity of Kansas Cancer Center, Kansas City, USA
cKansas Institute for Precision Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA
First published on 16th April 2024
Omics data sets often pose a computational challenge due to their high dimensionality, large size, and non-linear structures. Analyzing these data sets becomes especially daunting in the presence of rare events. Machine learning (ML) methods have gained traction for analyzing rare events, yet there has been limited exploration of bioinformatics tools that integrate ML techniques to comprehend the underlying biology. Expanding upon our previously developed computational framework of an integrative machine learning approach, we introduce PerSEveML, an interactive web-based tool that uses crowd-sourced intelligence to predict rare events and determine feature selection structures. PerSEveML provides a comprehensive overview of the integrative approach through evaluation metrics that help users understand the contribution of individual ML methods to the prediction process. Additionally, PerSEveML calculates entropy and rank scores, which visually organize input features into a persistent structure of selected, unselected, and fluctuating categories that help researchers uncover meaningful hypotheses regarding the underlying biology. We have evaluated PerSEveML on three diverse biologically complex data sets with extremely rare events from small to large scale and have demonstrated its ability to generate valid hypotheses. PerSEveML is available at https://biostats-shinyr.kumc.edu/PerSEveML/ and https://github.com/sreejatadutta/PerSEveML.
ML methods excel at discovering concealed patterns within complex data sets without extensive human involvement, making them invaluable in fields like omics data analysis. Their scalability and ability to process vast amounts of information, especially in high-throughput technologies, enhance their appeal.6 Unlike traditional methods relying on predefined assumptions, ML models learn directly from data, capturing intricate relationships and patterns that might be overlooked. However, analyzing rare events in omics data through ML poses challenges.2 Most ML algorithms struggle with imbalanced data, focusing on the majority class and overlooking patterns in rare events (minority class). ML models require sufficient data to learn meaningful patterns, yet in omics data sets, rare events like specific mutations or low-abundance molecules are often outnumbered by common events.7
Rare events in cancer-genomic studies refer to occurrences of infrequent disease outcomes compared to the controls—for instance, onsite rare cancers like gallbladder cancer and hairy cell leukemia. In quantitative trait studies, rare events could refer to the expression status of rarely expressed genes or low-abundance proteins.8,9 Rarely expressed genes and rare alternative splicing transcripts can provide insights into unique biological processes,10,11 low-abundance proteins may indicate specialized biological functions,12,13 and rare post-translational modifications (PTMs) can play critical roles in cellular signaling pathways.14,15 Analyzing rare events in omics data using ML methods can present several complications due to the inherent challenges posed by the scarcity of these events.
Past research suggested various approaches to deal with the problem of class imbalance, which include data-level, algorithm-based, and hybrid methods.16,17 The data-level approaches involve under-sampling the majority class or over-sampling the minority class before training the model, which is an additional step in data preprocessing. Some of the well-known sampling techniques include ADASYN (adaptive synthetic sampling)18 and SMOTE (synthetic minority over-sampling technique).19 Both SMOTE and ADASYN generate synthetic samples for the minority class. SMOTE creates synthetic samples along line segments between existing minority class instances, while ADASYN adjusts the synthetic sample generation based on the density of the minority class.
Algorithm-based techniques leverage robust algorithms to address class imbalance, employing methods such as cost-sensitive frameworks, where a higher penalty for misclassification is applied to the minority class for enhanced classification performance. These techniques also encompass optimizing hyperparameters through cross-validation, a procedure involving training models on subsets of the training set and evaluating them on unseen data subsets. Another approach to handling class imbalance is the hybrid method: A combination of data-level and algorithm-based approaches. Notable hybrid methods include SMOTEBoost20 and RUSBoost.21 Despite the comparable effectiveness of all these approaches, algorithm-based techniques are favored by ML practitioners due to their simplicity and systematic enhancement of ML performance. However, they come at the cost of significantly increased training time for ML models.
Several analytical tools have been developed in recent years for high-dimensional omics data, providing simplicity of implementation and results in a comprehensible format.22–26 For instance, HTPmod,22 introduced in 2018, is a web-based shiny application offering various ML methods and visualization choices for high-dimensional data sets. In 2021, multiSLIDE23 was introduced, enabling the visualization of interconnected features in omics data sets and aiding biologists in understanding underlying biological relationships. Enrichr-KG,25 developed in 2023, enhances enrichment analysis and visualization using knowledge graphs, serving as a valuable resource for gene enrichment analysis. Despite the availability of these advanced analysis tools, there remains a need for ML tools that specifically address the computational challenges associated with rare events and corresponding visualization techniques that can aid researchers in formulating meaningful hypotheses.
Analyzing rare events demands meticulous data preprocessing, thoughtful algorithm selection, and rigorous validation methods to ensure reliable results. However, analyzing rare events poses computational challenges due to limited data availability for the minority class or ML methods being overwhelmed by the majority class. To address this problem of rare data analysis, we have created an interactive tool called PerSEveML. PerSEveML allows users to predict rare events and visualize the contribution of input features to these predictions. Fig. 1 represents the tool's functioning and computational framework.
PerSEveML addresses common challenges in analyzing omics data sets, offering twelve ML methods suitable for small and large data sets. PerSEveML uses normalization techniques to handle non-linear data structures before training ML models. Six different normalization techniques have been integrated into the interface to ensure a wide application of this tool. These normalization techniques include log transformation for skewed data, standardization for feature scaling, and TopS, a normalization based on topological scoring, which effectively accentuates extreme data points for omics data with rare events.27,28 To comprehend the effect of normalization, ML practitioners often rely on data visualization tools such as boxplots. Therefore, we have incorporated box plots into the PerSEveML interface to visualize post-normalization data distribution.
PerSEveML is a versatile tool for various classification problems, uniquely capable of handling rare events through an integrated ML approach. While other ML toolkits like SuperLearner24 and HTPmod22 have attempted to tackle classification problems using multiple ML algorithms, these methods depend only on one best-performing model for feature selection. Our goal in adopting an integrative approach was to capitalize on the learning abilities of all top-performing models. Each ML algorithm is influenced by various factors, including decision boundaries, cost functions, sampling models, and hyperparameters; thus, suggesting that different models may identify distinct features that contribute to predicting rare events. Decision trees, for example, exhibit high variability with low biases, while models like logistic regression or linear discriminant analysis (LDA) have higher biases but lower variances. Proper training of each model is crucial to prevent underfitting or overfitting, considering the impact of biases and variances on ML performance.
PerSEveML is specifically tailored for complex biological data, such as large protein complex networks with multiple modules and shared subunits, where every feature holds biological significance. The tool allows users to assess, compare, and download the performance of the integrative ML approach with individual models using evaluation metrics. PerSEveML employs cross-validation for each selected ML model, enabling users to specify the number of folds, k, for cross-validation. Cross-validation evaluates a model's generalization by dividing data into training and validation subsets, ensuring reliable assessment across various partitions. Cross-validation serves two main objectives: optimizing model hyperparameters to prevent overfitting, and improving model performance for reliable predictions on unseen data.
In scenarios involving rare events, past research has employed ML methods with cross-validation.29,30 However, in order for cross-validation to work on rare event prediction, adequate information on the minority class need to be present. Thus, for scenarios where analysts require a sophisticated solution to deal with class imbalance, they can choose SMOTE or ADASYN. Our PerSEveML interface integrates SMOTE and ADASYN techniques, enabling users to incorporate these resampling methods for data analysis.
The PerSEveML interface allows users to visualize the correlation between input features and the persistent feature structure created using the integrative ML approach. Dutta et al.31 introduced the utilization of cut-point analysis to combine feature importance derived from diverse ML methods by utilizing entropy and rank scores to formulate a persistent feature structure. The determination of the persistent feature structure relies significantly on cut-point analysis, with the cut-off point defined as the percentage of features hypothesized by users to encapsulate the utmost information about the rare event of interest. PerSEveML offers users the chance to change the cut-off, thus allowing them to select an optimal cut-off point that works for their data set. The proposed feature structure is segregated into persistently selected, fluctuating, and unselected categories. These three categories can be used to select important features from the selected categories or generate a hypothesis using the fluctuating category to understand the association of a weak signaling feature with the rare event being studied. Moreover, the feature structure can serve as a metric for feature reduction. This involves excluding features from the unselected category, facilitating further experiments during the exploratory stage. The users are provided with the option to download the entropy and rank scores, alongside the persistent feature structure for further analysis.
We highlight the capabilities of PerSEveML by presenting three examples that utilize multi-omics data sets. Each of these data sets has varying sizes and rarity. The first data set is from a study of polychromatic flow cytometry on the rare population of human hematopoietic stem cells (HSCs).32 The cells are derived from human bone marrow cells from a single healthy donor. This data set has 44,140 data points and utilizes expression levels from thirteen (13) surface protein biomarkers to determine the presence or absence of HSC. The second data set is from a high-dimensional flow cytometry and mass cytometry (CyTOF) study on a rare population of activated (cytokine-producing) memory CD4 T cells.33 The cells in this data set are derived from human peripheral blood cells exposed to influenza antigens. To determine the presence and absence of T cells, this data set utilizes expression levels from fourteen (14) biomarkers and has 396, 460 data points. The third set of data evaluated on PerSEveML consists of proteomics data from Adams et al.34 focusing on SIN3/HDAC complexes. In this data set, the bait proteins are the features, and the prey proteins are listed in rows. The significance of bait proteins in the complex prediction of SIN3/HDAC complexes has been assessed through protein expression analysis and profiling of the interaction networks of SIN3/HDAC subunits. In summary, we demonstrated the capabilities of PerSEveML as a web tool that simplifies omics data analysis for data sets of different sizes with rare events, and enhances the understanding of biological systems.
1. Hyperbolic arcsine transformation: Hyperbolic arcsine transformation (with cofactor) is specifically designed for cytometry data to allow linearity around zero.35 Users can adjust the cofactor, although the default value is set at 150 based on previous studies.32,36 PerSEveML also allows users to perform regular arcsine transformation.
2. TopS normalization: PerSEveML features topological scoring, a method of normalization that is suitable for multi-omics data.27,28 TopS is a topological scoring method that accentuates extreme data points, thereby aiding the segregation of rare cell populations from abundant cells. TopS can effectively reduce the number of clusters for rare events; thus making it effective for analyzing biologically complex omics data sets. Let, Tij be the normalized value of ith biomarker of jth observation and Qij be the expression level of ith biomarker of jth observation. Then, TopS can be mathematically described by eqn (1).
(1) |
3. Percentage row normalization: This normalization is expected to work similarly to TopS but is designed specifically for proteomics data sets. The percentage row normalization can be defined by eqn (2).
(2) |
4. Log transformation: The log transformation is particularly beneficial when dealing with skewed data, as it has the capability to render transformed features resembling a Gaussian distribution.37 In the context of PerSEveML, users are granted the flexibility to introduce a constant (≥ 0.001) to their data points, ensuring the log transformation does not yield null values; thereby preserving the integrity of the analysis. This adjustment is also crucial for proper functioning within PerSEveML. Furthermore, log transformation finds application in cytometry data sets, especially when confronted with higher positive and negative intensities commonly observed in high-density multicolor flow cytometry (MFC).35
5. Min–max scaling: Min–max normalization usually scales the features between zero (0) and one (1).37,38 Min–max normalization is a common preprocessing technique among ML practitioners. The mathematical formulation of max–min normalization can be described by eqn (3).
(3) |
6. Standard scaling or standardization: Standardization transforms individual features into a standard normal distribution with a mean of zero (0) and a standard deviation of one (1).39 However, this type of normalization fails when the features are skewed.38 PerSEveML also included standardization or the z-score normalization.37,38Eqn (4) defines the z-score normalization mathematically.
(4) |
Users can choose not to normalize their data if it is already normalized or considered inappropriate for the data set. The selected normalization method is integrated into the data preprocessing stage only after the data has been divided into training and test sets, with the exception of TopS and percentage row normalization. PerSEveML extracts the mean and standard deviation for standardization, while the minimum and maximum values for min–max normalization from the training set to normalize the test set. This ensures that ML models are less prone to overfitting. Additionally, PerSEveML provides correlation plots and the option to download the correlation matrix for further analysis. These plots are valuable for comprehending the distribution of individual features within a data set and understanding the impact of the chosen normalization method. We furthermore advise users to perform external imputation methods such as KNNImput before uploading the data set into PerSEveML.
Upon training the user-selected models, the model performances can be evaluated based on evaluation metrics such as sensitivity, specificity, accuracy, kappa, and ROC–AUC. In addition, based on the predictive performance of all the selected models, PerSEveML internally performs a voting classification based on the highest number of predicted classes for individual observations on the test set; thus, constructing an integrative prediction incorporating predictions from all selected models. This prediction is also compared to the observed classes on the test set. If all the selected models perform well, the integrative ML shows performance metrics closer to one, while the performance of the integrative ML model lowers when one or more models do not show good performance. PerSEveML consolidates the performance metrics of the chosen ML methods into a unified table, presenting a comprehensive overview that includes both individual ML methods and the integrated ML model. Additionally, users can download this consolidated data for in-depth analysis.
(5) |
(6) |
The persistent feature structure serves as a feature selection method where the user can use the persistently selected categories to represent the features that emerged as important features in most of the ML algorithms, suggesting that the feature provides constructive information on the rare event prediction. The persistently unselected categories represent features that provided minimal information regarding the rare event, as the different ML methods indicated. Therefore, assisting researchers in formulating a plausible explanation for feature reduction. However, the most interesting category belongs to the group of fluctuating features. These dynamic features are pivotal in predicting rare events for certain methods but do not emerge as significant predictors for others. This variation suggests that the complexity of biological processes within these data sets leads different ML models to capture distinct patterns based on their computational algorithm and decision boundaries. Hence, these features provide significant hypotheses for future testing.
The application defaults to 40% as the optimal value for the cutoff for cut-point analysis. After testing various data sets within PerSEveML, we found that the optimal range for cut-point analysis is between 40–60%. Fig. 2 serves as an illustrative guide, shedding light on the working mechanism seamlessly integrated into the app.
Using PerSEveML, boxplots revealed that TopS, arcsine transformation with a cofactor of 150, percentage row normalization, and standard scaling yielded good results when applied to an 80:20% train-test split. We utilized TopS normalization, and hyperbolic arcsine transformation with a cofactor of 150 to facilitate performance comparison within PerSEveML. Tree-based algorithms, specifically XgBoost, demonstrated commendable performance on this data set, regardless of the normalization method employed. Non-tree-based models followed in performance, with linear models showing the least favorable results. To reach these conclusions, we assessed evaluation metrics such as AUC, sensitivity, specificity, and kappa.
The performance of the integrative ML approach was highly dependent on the individual models' performance. For instance, combining a linear model like logistic regression with XgBoost negatively impacted the integrative ML's performance. However, combining XgBoost with another tree-based model, such as AdaBoost, yielded significantly better results. The feature selection process displayed variations when altering parameters like the train-test split percentage, “k” value for k-fold cross-validation, and the cut-off for cut-point analysis. However, CD90bio and CD45RA consistently appeared in the selected feature category, while CD11b and CD123 consistently fell into the unselected category. These findings align with our existing knowledge of HSCs and suggest that the combination of CD90bio, CD34, and CD45RA can reliably identify the presence of HSCs in human bone marrow.45,46 Furthermore, we noted that the majority of the biomarkers for Nilsson rare remained within similar persistent categories, as illustrated by Dutta et al.31 The persistent biomarker structure utilizing 80:20 split, TopS normalization, 5-fold cross-validation, and 40% cut-off for cut-point analysis using three ML models (XgBoost, naïve Bayes, and LDA) on ADASYN incorporated Nilsson rare data set is presented in Fig. 3a.
Hundred and nine (109) cells from the Mosmann rare data set detected the presence of memory-activated CD4 T cells, highlighting an extreme class imbalance with a rarity of 0.03%. Prior research has underscored the crucial role of signaling biomarkers in identifying rare events,47–49 with biomarkers such as CD69 proving invaluable in identifying T lymphocytes and natural killer (NK) cells.50
The application of PerSEveML to the Mosmann rare data set revealed that five (5) of the six (6) normalization techniques yielded satisfactory results, with the sole exception being the log transformation. Similar to the Nilsson rare data set, our focus centered on TopS, percentage row, and hyperbolic arcsine transformations utilizing a cofactor 150 as normalization techniques. Notably, XgBoost consistently demonstrated superior performance compared to all other models.
In general, tree-based models outperformed non-tree-based or linear models. Throughout our iterations, it became evident that signaling biomarkers exhibited superior predictive capabilities compared to surface protein biomarkers. Signaling biomarkers such as IFNg, CXCR5, and TNFa consistently stood out as members of the selected category, while CD4 and GZB.SA often found themselves in the unselected category. This underscores the robust performance of PerSEveML in elucidating the underlying biology, as previously suggested by past researchers. The persistent biomarker structure using 80:20 split, percentage row normalization, 5-fold cross-validation, and 40% cut-off for cut-point analysis on the Mosmann data set using two ML methods (XgBoost and decision tree) is presented in Fig. 3b. To investigate the impact of SMOTE application on the persistent feature structure, we employed SMOTE with percentage row normalization, while utilizing 80:20 split, 5-fold cross-validation, and a 40% cut-off for cut-point analysis on the Mosmann data set. We selected XgBoost and naïve Bayes as our preferred ML methods based on their predictive performance. The ESI† Fig. S1 illustrates the feature structure. While the majority of the persistent feature structure remained similar across the two iterations, we observed differences for biomarkers CD3, CD69, IL2, and IL5, as they transitioned between persistently selected and unselected categories.
The task of predicting protein complexes with ML is a complex and challenging one in the fields of bioinformatics and computational biology. Protein complexes are important for various cellular processes, and knowing their composition can provide valuable insights into the functioning of biological systems. However, despite numerous attempts, identifying which human proteins exist in protein complexes and how they are organized on a proteome-wide scale remains challenging.51 Recently, ML approaches such as deep learning have been recognized for their potential to predict protein complexes from protein abundances.51 SIN3/HDAC contains seven homologous pairs: SAP30/SAP30-LIKE, ING1/ING2 (1-like), BRMS1/BRMS1-LIKE, RBBP4/RBBP7, HDAC1/HDAC2, SIN3A/SIN3B, and ARID4A/ARID4B.34
The authors of Adams et al.34 showed that proteins in homologous pairs exist in mutually exclusive pairs. Additionally, there are two distinct forms of SIN3 complexes in S. cerevisiae: RPD3L (SIN3 large) and RPD3S (SIN3 small). Higher eukaryote genes encode proteins similar to components of the SIN3 complex in S. cerevisiae. In humans, there are proteins like HDAC1/HDAC2, SIN3A/SIN3B, and RBBP4/RBBP7 that have similarities to the core SIN3 complex components RPD3, SIN3, and UME1 in S. cerevisiae. Additionally, humans have proteins similar to components specific to Rpd3L and Rpd3S. For example, SUDS3/BRMS1/BRMS1L, SAP30/SAP30L, and ING1/ING2 have similarities to RPD3L-specific components SDS3, SAP30, and Pho23, respectively. Within RPD3S, components like Rco1 and Eaf3 have similarities to human PHF12 and MORF4L1, respectively. This organization of the SIN3/HDAC complex highlights its complexity, making it an excellent system for ML analysis.
It can be difficult to differentiate between persistently selected and unselected baits when predicting SIN3/HDAC subunits. We experimented with various ML techniques and normalization methods to overcome this challenge. Our findings reveal that we can effectively distinguish between baits by using TopS alongside XgBoost and naïve Bayes while setting an 80:20 train-test split, cut-point of 30% and 4-fold cross-validation (Fig. 3c). As illustrated in Fig. 3c, we successfully separated mutually exclusive pairs within our data. For instance, ARID4B was persistently unselected while ARID4A was selected. Similarly, BRMS1L and SAP130 were persistently unselected while BRMS1 was selected. For the pair SIN3A/SIN3B, SIN3A was in the fluctuating group along with SAP30 in the SAP30/SAP30L pair. Although the ING1/ING2 pair is traditionally considered mutually exclusive, our data set includes both proteins in the purifications, explaining their presence in persistently selected features. Based on the criteria mentioned earlier, it was discovered that one of the subunits of the large complex, identified as SAP130, was not selected in the persistent group. This particular subunit could not pull down some subunits that compose the SIN3/HDAC complex, distinguishing it from other baits and placing it in the persistently unselected group. In the case of the small complex, the MORF4L1 bait was separated from the PHF12 bait in the persistently unselected group, as it pulled lower abundance proteins overall compared to PHF12 bait.
Thus, these results show that our ML approach applied to protein abundances can reveal hidden features for protein complex prediction that are not easy to detect without prior knowledge.
To demonstrate the robustness of PerSEveML in modeling and visualizing results from a crowd-sourced intelligence ML approach, we used three data sets varying in size, number of biomarkers, and rarity percentage. Each data set had unique complexities due to differences in the distribution of rare events across biomarkers. However, by normalizing the data and utilizing high-performing ML methods, we drew informative conclusions on the predictive properties of biomarkers using the persistent feature structure. PerSEveML's use of entropy and rank scores allows for less rigid results than feature importance generated by individual models. PerSEveML stands out to SuperLearner24 and HTPmod,22 which are similar tools, in the context of a more robust feature selection method since it not only utilizes multiple ML methods to predict but combines the strength of pattern recognition from many ML methods to perform feature selection using the persistent feature structure.
One of the focal points of PerSEveML is the persistent feature structure. Therefore, it is crucial for users to understand the implications of this structure and its pivotal role in understanding the underlying biology. It is worth noting that certain ML methods, such as LDA and logistic regression, capture linear associations among features, while tree-based algorithms, naïve Bayes, and SVM (employing non-linear basis functions like polynomial kernel or radial basis function) consider non-linear decision boundaries. As a result, the overall persistent feature structure may comprise a combination of these linear and non-linear decision boundaries, essentially encapsulating different facets of the data. Furthermore, in situations where the feature structure is utilized to generate hypotheses for future experiments or for feature selection during the exploratory stage, users must recognize that features in the persistently fluctuating category possess some signals necessary for rare event prediction; however, due to the structure of the decision boundary of various algorithms, these features may not exhibit signals as robust as those in the persistently selected category. Thus, it is important to consider the features from the fluctuating category alongside the features from the selected categories to avoid misleading conclusions.
PerSEveML automates data analysis with a point-and-click interface. The application requires users to input display-ready data sets organized with observations in rows and features in columns, and does not offer preprocessing functionality, such as handling missing data. Based on preferences, we suggest the users perform imputation methods such as KNNImput, prior to uploading the data set into PerSEveML. Users must also select ML methods that work best for their data sets and assess multicollinearity prior to training final models to avoid drawing invalid conclusions. Understanding the model performances and deciding whether to include individual ML models in the final analysis is at the discretion of the user.
Even though PerSEveML was built on our previous work of Dutta et al.31 PerSEveML's computational framework focuses on faster computation and easy ML implementation in analyzing rare events. For instance, we decided to work extensively with a single normalization method. Even though implementing TopS and percentage row normalization is time-consuming for very large data sets, we decided to include them in the application since these normalization techniques are extremely important when working with omics data sets. In addition, unlike Dutta et al.31 we have not included KNN as a part of PerSEveML due to two reasons: during the app development stage, we found that KNN takes a significantly longer time to tune parameters; secondly, for none of our test cases the algorithm showed optimal performance. Another deviation from the original work is related to the two methods of calculating feature importance—one via the inbuilt feature importance method from the caret package, and the other using the stepwise ROC method. To enhance the user experience of our application, we have removed the stepwise ROC analysis feature from PerSEveML due to its considerable computational demands.
As demonstrated through the examples of Nilsson and Mosmann rare data sets, we found that PerSEveML could capture all the major findings from past articles.31 Users can leverage PerSEveML in combination with different ML methods and various normalization techniques to uncover hidden patterns. For users keen on exploring the original computational framework with two normalizations, PerSEveML offers easy access to entropy and rank scores. By readily downloading these scores, users can seamlessly implement the original author's approach31 and discover a more robust version of the persistent biomarker structure.
Our study affirms the robustness of PerSEveML in identifying relevant biomarker structures and detecting subtle shifts in their categorization. Additionally, it showcases PerSEveML's ability to analyze intricate data structures such as stem cells that belong to multiple clustering groups and protein complexes consisting of modules with shared subunits and mutually exclusive pairs. By refining and expanding our understanding of these persistent features, PerSEveML stands as a valuable tool for unraveling the complexities of biomarker-driven phenomena in various domains of network-based research.
We envision that additional applications will be added to the PerSEveML app in the near future. These include perturbation data prediction, disease survival outcome prediction based on omics data sets, and neuroimaging data with genomics profiles.
The PerSEveML tool is accessible for free at https://biostats-shinyr.kumc.edu/PerSEveML/. For handling larger data sets, we recommend downloading the application from GitHub (https://github.com/sreejatadutta/PerSEveML) and running it locally on your system. Note that performing resampling methods on large data sets require more computation time.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4mo00008k |
This journal is © The Royal Society of Chemistry 2024 |