Carl M.
Kobel
a,
Jenny
Merkesvik
b,
Idun Maria Tokvam
Burgos
c,
Wanxin
Lai
b,
Ove
Øyås
a,
Phillip B.
Pope
abd,
Torgeir R.
Hvidsten
b and
Velma T. E.
Aho
*a
aFaculty of Biosciences, Norwegian University of Life Sciences, Ås, Norway. E-mail: velma.tea.essi.aho@nmbu.no
bFaculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, Norway
cFaculty of Natural Sciences, Norwegian University of Science and Technology, Trondheim, Norway
dCentre for Microbiome Research, School of Biomedical Sciences, Queensland University of Technology (QUT), Translational Research Institute, Woolloongabba, Queensland, Australia
First published on 4th July 2024
Holo-omics is the use of omics data to study a host and its inherent microbiomes – a biological system known as a “holobiont”. A microbiome that exists in such a space often encounters habitat stability and in return provides metabolic capacities that can benefit their host. Here we present an overview of beneficial host–microbiome systems and propose and discuss several methodological frameworks that can be used to investigate the intricacies of the many as yet undefined host–microbiome interactions that influence holobiont homeostasis. While this is an emerging field, we anticipate that ongoing methodological advancements will enhance the biological resolution that is necessary to improve our understanding of host–microbiome interplay to make meaningful interpretations and biotechnological applications.
Term | Definition |
---|---|
Habitat | A defined ecological niche that provides environmental parameters that supports a set of organisms. |
Holo- | From Ancient Greek ὄλoϛ: hólos, “whole”. |
Holo-omics | Research that analyses one or more functional layers of omics data from both host and microbiome. The terms holo-omics and hologenomics might be used interchangeably because most omics layers arise from genomic DNA. |
Holobiont | An ecological unit consisting of a host and its resident, interacting micro-organisms. |
Host–microbiome interface | Any surface where biological features from either host or microbiome can interact. |
Integrative analysis | Overlapping or relating the biological factors between two molecular layers or host–microbiome sources. |
Metagenomics | Techniques used to study the collective genomic reads from all organisms in an ecological niche. |
Multi-omics | Research covering more than one omics layer representing one or multiple interacting organisms. Examples of the former include human multi-omics with measurements that only reflect human biology; and microbial multi-omics without taking the host into account. |
Omics | The study of all biomolecules of a specific type. This review focuses on functional omics data, which can be defined as omics data that change over time and across conditions. |
Proteomics | Using a bespoke database which is based on in silico translation of the genomic sequences, to match mass spectrometric spectra to measure the abundance of proteins in a sample. |
Transcriptomics | Techniques used to study an organism's transcriptome, i.e. the sum of all of its RNA transcripts. |
Untargeted metabolomics | Using methods such as mass spectrometry (MS) or nuclear magnetic resonance (NMR) to measure the abundance of all the metabolites in a sample. |
![]() | ||
Fig. 1 Holo-omics is a specialised case of multi-omics where biological features are linked across a host–microbiome interface. (A) This interface is idealised along the horizontal axis labelled “holo-omic” as an epithelium with a large surface area where biochemical compounds can be exchanged in both directions. The vertical axis labelled “multi-omic” highlights that interactions can occur on multiple levels in terms of coding sequences and biochemical compounds. (B) Examples of molecular interactions across a host–microbiome interface.4–6 Created with biorender.com. |
Acquiring a dataset to study host–microbiome interactions is a matter of applying various omics technologies to measure the molecular features of both sides of the holobiont. While this data acquisition used to be the limiting step in such analyses, modern molecular biology tools are making this process more efficient and economical. Today's primary technical bottlenecks are (1) overcoming microbial community complexity, which can contain thousands of different genomes compared to their singular defined host, and (2) the computational analysis of holo-omic data so that the biological processes of both the host and its microbiome can be integrated computationally, interpreted, and visualised.7 For example, performing data integration across the host–microbiome interface requires correlating individual biological features across various omics layers, which often cannot be scaled to the typical size of holo-omic datasets and can also suffer due to insufficient statistical power. To meet this challenge a new family of computational tools is needed: they must be able to cluster biological features into modules and cross-correlate features across the host–microbiome boundary, capturing the signals that represent the hypothesised cooperation between the host and its microbiome.
Symbiotic interactions in host-associated microbiomes are generally defined by the mutualistic, commensalistic or neutral effects shared between each organism, which depend on whether the benefit involved is one-way, two-way or lacking, respectively.8 Additionally, there is a spectrum of harmful-neutral interactions within the microbiome and between certain microorganisms, and opportunistic pathogens, viruses, and phages might play a role in defining the dynamics of the microbiome.9 Additional layers of intra-microbiome complexity should also be considered, particularly for the existence of networks of symbioses within a given microbiome which can be characterised in isolation, as with any other microbial environment. What distinguishes holo-omics is that the host variation is integrated together with any intra-microbiome relationships. Subsequently, holo-omics makes it possible to understand the intra-microbiome dynamics where a host-directed interaction is imposed on the microbiome.
For this review, we discuss in detail how actual holo-omic analyses can be performed computationally and present several frameworks to take the typically massive and complex holo-omic datasets and integrate the signal between the host and its microbiome. We consider host–microbiome studies where the host is a multicellular organism like an animal, fungus, or plant that forms a large surface or boundary from which it can interact with the microbiome that typically consists of a community of single-celled microorganisms (bacteria, archaea, eukaryotes) and possibly viruses, with varying degrees of diversity (Table 2). For simplicity, we do not consider parasite interactions in this review but focus on the beneficial interactions in holobionts.
Holobiont system | Symbiosis | Microbiome richness | Host → microbiome services | Microbiome → host services |
---|---|---|---|---|
Cattle rumen | Mutualistic8,10 | 8500–16![]() |
Habitat, substrates15 | Catabolism of complex plant fibres,15 anabolism of essential chemicals |
Mouse gut | Mutualistic16 | 828–1573 species17,18 | Habitat, substrates | Catabolism of feed matter, anabolism of essential chemicals16,19 |
Salmon gut | Commensalistic | 30–40 species (prokaryotes)20 | Habitat, substrates | Unknown |
Plant root-soil | Mutualistic, commensalistic21 | 2799–271![]() |
Energy (sugars, fibres)24 | Nutrients, nitrogen,25 stress resistance21 |
Bee gut | Mutualistic | <10 species26–29 | Habitat, substrates | Modulate social behaviour,29 catabolism of carbohydrates28 |
Many known hosts are obligate symbionts, meaning the host is non-viable when the microbiome is absent. One example of an obligate holobiont is lichen, where a fungus and a community of cyanobacteria represent a complete holobiont. The fungus provides physical anchoring and nutrient assimilation whereas the cyanobacteria provide carbohydrates assimilated through photosynthesis. Additionally, these holobionts may house Alphaproteobacteria which work in conjunction to fix nitrogen for the lichen, which may otherwise be nutrient-limited.30 On the other end of the spectrum of dependency are several types of insects, such as ants and caterpillars, which harbour few or no resident microorganisms that are unlikely to have a large impact on fitness.31 Mammalian hosts tend to fall between these two extreme examples: they are viable when raised in a germ-free setting, but experimental results suggest various abnormalities in such animals, ranging from changes in the immune system to altered neurodevelopment and behavior.32,33
The microbiomes of holobionts are per definition not mediated through the somatic genome of the host which means that the microbiota must have its own way of transmitting genetic material to offspring or between individuals in a population. This means that the composition of species present in a microbiome is subject to change over time as new species colonise and take over functions of others.40 Host–microbiome co-evolution and adaptation is possible when new microbiota become part of the holobiont in a population of hosts, and are inherited vertically to offspring or between individuals in a host population. This can give rise to endemic microbiota species which are exclusively found as part of a holobiont. The microorganisms can adapt to their host and thus diverge from their ancestral population. Hosts and microbiota are able to co-adapt evolutionarily which means that they can each specialise and optimise their function in the holobiont system over generations.41,42
Most frameworks are statistical in the sense that they test whether there are significant differences between treatments or co-appearing groups, but suitable mechanistic models are increasingly available and used for data integration as well.44 To integrate omics data, these mechanistic models should ideally account for the dynamics of all relevant genome-scale networks in the holobiont system, but scaling to systems of this size entails major computational challenges for dynamic models in particular.44 Because of this, mechanistic omics integration studies have mainly used genome-scale metabolic models (GEMs), which capture the steady-state flows of metabolites through an organism's network of biochemical reactions45 and are available for a range of hosts and microorganisms.46 By linking metabolic flows to interactions between host and microbiome, GEMs integrated with holo-omics can allow mechanistic investigation of holobiont systems. Dynamic modelling of genome-scale interaction networks is also becoming feasible thanks to algorithmic and computational advances,47 but most of the methods that we will discuss here take a statistical approach where they compare and compute significance between groups.
Recent holo-omic research articles provide examples of the different types of questions that can be approached from a holo-omics point of view, ranging from experiments with model organisms to comparative evolutionary studies. In the classic experimental end of the spectrum, two studies used a mouse model to address two “epidemics” faced by human medicine: opioid overuse48 and obesity.49 Both studies included host transcriptomics, microbial shotgun metagenomics, and untargeted metabolomics, the latter capturing a mix of molecules produced by the host and the microbiome. Their results suggested that the tested medications – morphine in the opioid study, the antidiabetic drug empagliflozin in the obesity study – had effects on the host and microbial layers.48,49 Both studies further confirmed that there are correlations between different omic layers, offering the simplest kind of evidence for host–microbiome interactions. The opioid study also tested this experimentally by showing that morphine-induced changes in host gene expression vary depending on the presence of a microbiome.48
In an example closer to traditional ecology, a study focusing on the gut of the termite Labiotermes labralis used metagenomics, metatranscriptomics, and host transcriptomics data to demonstrate that the host and the microbiome provide complementary sets of carbohydrate-active enzymes, enabling the holobiont to degrade a wide range of soil polysaccharides.50 Finally, a study taking a holo-omics approach to evolution compared several ant- and termite-eating mammals, with findings that supported convergent evolution not only in host genomes, but also in microbiomes.51 Specifically, the gut metagenomes of these mammals were enriched in enzymes that are necessary for subsisting on an insectivorous diet, such as chitinases and trehalases, compared to mammals with other types of diets.
While the existing publications showcase the exciting opportunities offered by holo-omic research, many of them include only one omics layer for each side of the holobiont. Comprehensive, multi-layered integrative studies remain rare, partly due to financial limitations, but also to the challenges presented by bioinformatic and statistical analyses.
Let us consider a hypothetical holo-omic study, where we have measured the host transcriptome of the liver in 100 cows (n = 100) and the meta-transcriptome of the rumen content in those same individuals (p = 20000 host genes + average 3000 microbial genes × 200 microbial species = 620
000 features). Let us further assume that the experiment is set up to measure methane emission, and that half of the cows were given a methane-inhibiting feed additive (treatment) that indeed reduced emissions. This dataset would pose a massive challenge for data analysis, and not primarily because it would require considerable computational resources to assemble and annotate Metagenome Assembled Genomes (MAGs) and estimate expression (read mapping). The main challenge is related to the large number of features compared to samples. Naively one would think that this dataset could be analysed using multivariate- or machine learning-based prediction methods, where the predictive model could be queried for features or combination of features that contributed significantly to the prediction; “IF gene G on MAG5 is up AND host gene H is down THEN low methane”. However, with this many features there will be an enormous number of feature combinations that could separate low and high emitting cows, and with only 100 examples (cows) to constrain them, we would never be able to discern real biological feature-combinations from spurious ones (Fig. 2). This phenomenon is referred to as overfitting and is a consequence of the curse of dimensionality: the number of examples (cows) needed to identify the biologically meaningful features grows exponentially with the number of features.
Methods that divide the aforementioned examples into training and test sets, such as cross validation, would be able to tell us that we are overfitting, but will not be able to solve the problem. Even testing one feature at a time is problematic, since multiple hypothesis testing would severely limit the statistical power and thus only identify features with very large and consistent differences (i.e. large effect sizes) between the two treatments. Luckily, omics features are by no means independent and can be grouped into modules of co-abundant genes, proteins, or metabolites, for instance by correlation. This and other so-called dimensionality reduction approaches typically result in a few dozen distinct modules that can be used as our new features to reveal connections to methane emission and also to hypothesise putative interactions between host and microbiome. A note of caution here is that methods for module finding that rely on computing a distance matrix would require extreme amounts of memory. An approach used for instance by weighted gene co-expression network analysis (WGCNA, a method discussed later in this review) is to first group the data into “blocks” using k-means clustering, find modules in each group, and then combine similar modules at the end.
Integrating several omics datasets for a multi-omics approach can help us hone in on biologically meaningful patterns, if done carefully. Assuming that we added metabolomics data to the aforementioned cow example; simply concatenating the transcriptomics and metabolomics table would leave us with even more features (number of genes + number of metabolites). Instead, one could first identify genes and metabolites that are differentially abundant between “low” and “high” methane-emitting cows, and then select pathways that are enriched in both differential genes and differential metabolites. Such consensus integration methods use information about multiple types of molecules to constrain the number of possible biological interpretations.
Although there are strong functional interdependencies between rumen microbes converting feed into fatty acids and the host animal metabolizing fatty acids in the liver to produce energy, there are also clear physical boundaries separating these features, meaning that we should consider omic data origins in our holo-omic analysis design. In the case of pathway analysis, for example, one needs to consider that a pathway operates within the confines of a cell of a single organism. More generally, most integration methods are designed for a single species, and thus cannot be applied directly in a holo-omics setting. Any pattern discovered in omics data with the aim of describing host-microbiota interactions must include biomacromolecules originating from both sides of the holobiont boundary. This might be accomplished by first applying a standard (multi-)omics analysis method and then filtering the results afterwards, e.g. selecting modules containing genes from both the host and the microbiota. However, integrating the host-microbiota constraint as an integral part of the data analysis method could drastically reduce the search space, help deal with the curse of dimensionality and force results to include features from the host that might otherwise drown in the sea of microbial features. The methods described below are selected because we find them especially promising for solving challenges related specifically to holo-omics data sets.
Canonical correlation analysis (CCA) is a statistical technique akin to PCA in terms of finding a linear transformation of the original variables that consists of orthogonal vectors.54 The objective of CCA is to summarise the linear relationship between two sets of variables by identifying linear combinations – called canonical variables – that maximise correlations based on pairs of loading vectors. Although CCA is not primarily designed for dimensionality reduction, it plays a crucial role in comprehending multivariate relationships by revealing the directions in which two sets of variables are most interdependent. Several extensions of CCA further enhance its applicability: (i) multiset CCAs analyse maximal correlations across multiple sets of omics data; (ii) sparse CCAs identify a subset of variables most relevant to the canonical variables by introducing sparsity constraints; (iii) regularised CCAs incorporate regularisation which is particularly beneficial when dealing with high-dimensional data or when variables are not well-captured by linear transformations; and (iv) partial least squares CCAs which focus on predicting one set of variables using another, thus combining aspects of partial least squares regression with CCA.55 These extensions cater to diverse scenarios, offering flexibility to address specific challenges in multivariate analysis and canonical correlation.
Principal coordinates analysis (PCoA) is a linear transformation method similar to PCA which incorporates multidimensional scaling, creating dissimilarity matrices to visualise sample relationships.56 Unlike PCA, PCoA is not limited to Euclidean measures and has been shown to be useful for comparing beta-diversity in microbial contexts. Non-metric multidimensional scaling (nMDS) is popular for amplicon/shotgun sequencing data, offering a rank-based approach that handles non-linear relationships and outliers effectively, albeit with potential distortions in global structures.57–60 Non-linear methods like t-distributed stochastic neighbour embedding (t-SNE) and uniform manifold approximation and projection (UMAP) belong to the second type of dimensionality reduction, known as neighbour graph algorithms.59–62 These methods emphasise preserving local structures, relying on graph layout algorithms to create probabilistic weighted graphs representing relationships between high-dimensional data points. UMAP and t-SNE differ primarily in their theoretical foundation for balancing the local and global structures.53 While t-SNE results can vary between runs due to its stochastic nature and sensitivity to initialisation, UMAP, although also stochastic, tends to demonstrate more stability across runs. UMAP excels in preserving the global structure of the final projection while still capturing local relationships, it is hence a better choice for prediction tasks.59,60 Nonetheless, it may struggle to distinguish closely nested clusters. It is crucial to note that all three non-linear methods are sensitive to initialisation, and it is recommended to employ the first two principal components from the linear approach as seeds for initiation. Users should implement these exploratory methods with caution, exploring various hyperparameters, running multiple projections for stability. When choosing a non-linear dimensionality reduction method, careful consideration of data scale, characteristics, and specific research goals is essential.63
Non-negative matrix factorisation (NMF)64 is a method for dimensionality reduction that has been used both in several multi-omic studies and as a basis for additional tools for multi-omic data integration and analysis.65–69 NMF has the same foundation as PCA, essentially decomposing a large data matrix (D) consisting of feature values (p) across biological replicates (n) into a reduced set of (r) linear expressions. These expressions are represented by two matrices smaller than the original data; one with weights (W, p × r) and one with the reduced feature components (F, r × n) (Fig. 4A).
![]() | ||
Fig. 4 Comparison of two methods for matrix factorisation; (A) and (B) non-negative matrix factorisation (NMF) and (C) and (D) multiset correlation and factor analysis (MCFA). Both methods reduce a full set of observed data d (columns of D) into linear expressions of reduced features f (columns of F) transformed by multiplication with weights (W). (A) and (C) In contrast to NMF, the MCFA method reduces the dataset into two spaces, either shared between all omics layers (S) or private to each one (P). (B) and (D) All features contribute to approximate the observed data for each shared omics layer, visualised in the same style as Fig. 4 in ref. 64. |
In contrast to PCA, NMF requires the decomposition matrices to contain non-negative values only. This constraint causes the NMF-derived linear expressions to only consist of addends, thereby preventing cancellations between biological factors with opposing signs. NMF thus reflects the idea of assembling parts – analogous to the omic data layers – into a larger image representing the whole system. Simultaneously, the non-negativity constraint of NMF necessitates the compressed data to be seen as an approximation (≈) of the real data rather than as an equality (=).70 Our objective function for determining the decomposition matrices then becomes to minimise the difference between the real data (D) and the approximation (WF). This iterative approach may yield different solutions based on the initial weight and reduced component matrices, potentially affecting the outcome of the analysis.71 Hence dimensionality reduction by NMF may be more in line with the analogy of assembling omic datasets to uncover interactions between layers of the complex system, although resulting in an approximated model with a potentially large residual difference caused by the lossy factorisation.
Another approach to holo-omic dataset integration based on matrix factorisation is multiset correlation and factor analysis (MCFA)72 (Fig. 3 and 4C). While also seeking to compress observed data (D) into matrices for weights (W) and reduced components (F), MCFA effectively divides the model into two parts. One set of decomposition matrices fit the so-called shared space (S), consisting of reduced features with implied importance across all the included omics layers. This shared space is determined through an extension of CCA called probabilistic CCA (pCCA), and it serves the same purpose as the general decomposition seen in NMF. Additional sets of decomposition matrices are then fitted for each individual omics layer through factor analysis, based on the residual between the read data (D) and the modelled shared space. These “private” aspects of the model reflect contributions from factors that are only perceived as important for observations in specific omics layers. The full model then combines the shared and private spaces to approximate the real data, determining the weight and feature matrices through an expectation maximisation algorithm, with the remainder (ψ) being quantified a third addend to complete the expression.
By fitting the observed data to shared and private reduced features separately, the MCFA method may help distinguish between components with implied importance across all levels of the holobiont and those that only appear relevant for a particular omics layer. Additionally, introducing a private model layer for each omic may leave a smaller residual than had the model only covered components relevant for all included data layers. At the time of writing, MCFA has not been applied in a peer-reviewed study since its publication in August 2023, thus its versatility for holo-omic data integration has yet to be demonstrated.
WGCNA is a popular framework for investigating associations between biological features within a single omics layer.75 It calculates an adjacency matrix containing transformed, pairwise correlations between biological features such as genes, proteins, and metabolites. The adjacencies are transformed in order to obtain a scale-free network, in which features can be related to continuous and categorical external data like phenotypic traits or treatment groups. On the basis of these adjacencies, the topological overlap measure can be used together with hierarchical clustering to obtain a set of clusters where each biological feature becomes part of only one of the formulated clusters. In WGCNA terminology, these clusters are referred to as “modules” and are represented by their first principal component. This linear combination of biological features is referred to as an eigengene and is idealised to capture the most important variation of the module with limited noise. Since these modules are called without utilising information about treatments or traits, the method can be characterised as unsupervised.
WGCNA can be extended to holo-omic data76 by relating the modules across the host–microbiome boundary. WGCNA has been applied for both clustering and dimensionality reduction in several multi and holo-omic studies related to both plant77 and animal biology.76,78–80 One study concerning the gut microbiome in patients with insulin sensitivity or resistance79 applied a range of node selection and dimensionality reduction methods on their data, and used WGCNA to find clusters of hydrophilic and lipid metabolites. These were later connected to other omics layers to identify clusters associated with metabolism of the gut microbiome between the groups of patients.
Alternative clustering methods can also be employed for dimensionality reduction. A state-of-the-art example is the Leiden algorithm,81 which is an optimisation-based form of clustering. The algorithm was used in a study of HIV patients in which they investigated health in relation to the microbiome of the patients. Specifically, they used the Leiden algorithm to detect clusters of microbiome-derived metabolites before integrating these features with other omics layers.82 Similarly, a study of the SARS-CoV-2 used the Leiden algorithm to detect clusters of metabolites.80
Transkingdom network analysis (TkNA)83 for holo- and multi-omics is a network-based method that detects biological features that differentiate treatment groups. TkNA is designed to handle a binary testing condition, such as “disease” and “control”. The method consists of a comprehensive pipeline containing all the functions needed to transform normalised data into a network that can be readily visualised. TkNA creates a co-variation network and calculates node statistics like node degrees and bipartite betweenness centrality (BiBC). This approach emphasises that hub nodes with high BiBC and degree represent potential modules of the biological network. Additionally, TkNA interfaces with the Infomap84 and Louvain85 network clustering algorithms, which can aid in the interpretation of a biological network further.
The size and complexity of networks created from holo-omics datasets make them hard to interpret, hence it is necessary to find ways to categorise and structure the represented data. Clustering nodes and thus reducing the number of visual features to consider can help organise the network. This is exemplified in the aforementioned SARS-CoV-2 study where WGCNA was used to recognise clusters across omics layers. The cross-omic clusters correlating with disease severity revealed a relationship between host serum metabolites and microorganisms.80
In gene set enrichment analysis (GSEA), a gene set usually represents a metabolic pathway that performs a specific biological function. By testing whether there is an enrichment of genes from a specific pathway in a network cluster or module, we can argue that this pathway is captured by the module, thereby drawing further conclusions about its activity by interpreting the module's omics profile and association to other phenotypic metadata. GSEA can be applied on clusters that are defined using any clustering algorithm. An example is a study on the Atlantic salmon76 where gene enrichment analysis was used to show that certain host RNA genes responded to long chain fatty acids in the feed. A similar method86 for improving interpretability is network enrichment, in which functional information and network connectivity is integrated. Instead of testing for a significant difference between treatment groups like GSEA, network enrichment quantifies the differential representation among neighbours in the gene network.87
A network can be interpreted by statistical concepts that describe crucial properties of the nodes and how they are connected. Degree is simply the number of neighbours of any node. The degree can be expressed relative to the node with the highest number of neighbours, hence degree centrality. Node betweenness describes how many of the pairwise node connections in the network pass through a specific node. If this betweenness measurement is high, the node represents a bottleneck and is indicated to have a potential regulatory effect.83 The cluster coefficient of a node describes the number of edges between its neighbours in relation to the possible number of edges between these. Coreness considers the neighbourhood of a node as it describes whether a node is part of a “core” of nodes that are all interconnected with a certain degree (k). Hence, a network can be characterised by the maximum coreness of all nodes. Eigenvector centrality is another network statistic computed for each node in a network. The maximum eigenvalue of the adjacency matrix is computed and is used to normalise the eigenvector, which becomes the eigenvector centralities. Generally, nodes with high eigenvector centrality are essential and interact closely with their respective neighbours.74 In a study of periodontal disease and response to different treatment, eigenvector centrality was used specifically to find nodes in the network that were connected to other highly connected nodes.88 This revealed microbial taxa that could be more closely associated with the patients’ health status. The same study also looked at the network transitivity – describing the ratio of connected triplets to the number of possible connected triplets – for the networks over different patients and disease states. This statistic is high in the presence of clusters, and the more severe disease cases in the study were associated with lower transitivity. A higher interdependence (i.e. transitivity) between microbes was therefore shown to be beneficial for the patient. The severe cases were also more often associated with networks with a high diameter – meaning the shortest path between the most distant nodes – which is expected with low transitivity.
MixOmics90 is a toolkit that offers both unsupervised and supervised statistical approaches for multi-table datasets, ranging from single omic analysis to complex multi-omics. The supervised method for multi-omics, titled Data Integration Analysis for Biomarker discovery using Latent cOmponents (DIABLO),91 is based on partial least squares regression/projection to latent structures92 discriminant analysis (PLS-DA)93 and sparse generalised canonical correlation analysis (sGCCA),54 an extension of the CCA method. The sparse version of DIABLO involves using lasso94 to select those features from each layer that best discriminate between groups of interest. Since DIABLO does not assume any particular distributions from the input data,91 it is applicable for holo-omic datasets, as long as each layer is normalised in a way that is appropriate to that data type. The limiting factor of this approach is that DIABLO is a supervised method aimed at classification of data into pre-established groups of interest, which makes it less useful for basic, explorative holo-omic studies. Examples where this tool has already been used include a study of the relationships of gut microbiota, dietary fatty acids, and liver gene expression in mice;95 and the effects of cyanobacterial blooms on the microbiome and metabolome of the medaka fish species.96
For studies that do not involve a predefined grouping variable, mixOmics is compatible with mixKernel97 for multi-omics integration. This explorative, unsupervised approach is based on forming a kernel – a symmetric and positive function that provides pairwise similarities between samples – to represent each layer of data.97 These can be combined into a meta-kernel by creating one of two alternatives: (i) a consensus kernel, or (ii) a sparse kernel that preserves the topology of the original data. The meta-kernel can then be used in downstream analyses, for example kernel PCA (KPCA)98 for visualisation of the different layers. Since mixKernel is suited for heterogeneous data, it is also applicable for holo-omics. So far, this method has not been commonly utilised in a host–microbiome context, but it successfully complemented simpler, single-table statistics when selecting plant-beneficial bacterial strains for rice cultivation based on plant growth related measurements.99
Another explorative method is mCIA100 – a multi-table version of co-inertia analysis (CIA or COIA)101 – which has been tested for selecting rice growth promoting bacteria.99 CIA resembles sPLS in that it also searches to maximise the covariance between eigenvectors.100 mCIA has been extended to create sparse mCIA (smCIA) which adds feature selection, improving the interpretability of the results.102 There is also a further extension, structured sparse mCIA (ssmCIA), which enables incorporating structural information about variables, such as regulatory networks for genes.100 However, this is less relevant for holo-omic analyses as such pre-existing information is seldomly available.
Compositional omics model-based integration or COMBI103 is another explorative, unsupervised multi-table method. It is particularly appropriate for host–microbiome analyses since it has been designed to account for compositionality, a feature common to many microbiome measurements such as 16S rRNA gene amplicon data and shotgun metagenomic data.104 Specifically, compositional data is handled through using the centred log-ratio transform as a link function in the models, while the integrative part of the approach is based on inferring latent variables.103 This method also offers visualisation of the results as a multiplot showing the features with the largest loadings.
Finally, latent dirichlet allocation (LDA) is a form of unsupervised dimensionality reduction105 (Fig. 3). It uses a specific terminology as it was originally invented for use in text mining. In a corpus – a set of text documents that represent a spectrum of topics – it allocates each word to a predetermined number of topics so that each word in the total vocabulary belongs to one topic. Each topic is a set of words that, as a whole, revolve around a semantic context. Although the topics are coherent and represent an underlying theme, the title of each topic must be defined manually by interpretation of the listed words in each topic. As a text mining tool, LDA doesn’t immediately lend itself useful for biological data inquiries. But, consider substituting a corpus for an omics layer: documents become biological samples, and genes or compounds become the words. By doing so, the model will be able to capture latent topics defined by biological features that tend to occur together in the same documents (co-abundance), forming topics that represent metabolic functions in the samples. This text-biology analogy means that LDA can be applied for use in biological studies.106
As this is a new, fast-moving field, there still is no consensus of what is the best way to do science using holo-omics. We hope that this review can generate discussion and new ideas on how to approach the further development of holo-omic methodologies, and we are positive that gold standard methodologies will soon be established.
This journal is © The Royal Society of Chemistry 2024 |