Systems biology approach to elucidation of contaminant biodegradation in complex samples – integration of high-resolution analytical and molecular tools †

We present here a data-driven systems biology framework for the rational design of biotechnological solutions for contaminated environments with the aim of understanding the interactions and mechanisms underpinning the role of microbial communities in the biodegradation of contaminated soils. We have considered a multi-omics approach that employs novel in silico tools to combine high-throughput sequencing data (16S rRNA amplicons) with chemical data including high-resolution analytical data generated by comprehensive two-dimensional gas chromatography (GC (cid:1) GC). To assess this approach, we have considered a matching dataset with both microbiological and chemical signatures available for samples from two former manufactured gas plant sites. On this dataset, we applied the numerical procedures informed by ecological principles (predominantly diversity measures) as well as recently published statistical approaches that give discriminatory features and their correlations by maximizing the covariances between multiple datasets on the same sample space. In particular, we have utilized sparse projection to latent discriminant analysis and its derivative to multiple datasets, an N-integration algorithm called DIABLO. Our results indicate microbial community structure dependent on the contaminated environment and unravel promising interactions of some of the microbial species with biodegradation potential. To the best of our knowledge, this is the ﬁ rst study that incorporates with the microbiome an unprecedented high-level distribution of hydrocarbons obtained through GC (cid:1) GC.


Introduction
Industrialisation has le behind a legacy of pollution that presents a risk to human health and the environment. Disposal to landll has long been the preferred approach for disposal of contaminated soils, however rising landll tax costs drive the development of novel, cheaper and safer remediation technologies.
Newly developed sustainable and safe remediation approaches include ex situ (such as biopiles) and in situ (monitored natural attenuation (MNA) and enhanced natural attenuation (ENA)) bioremediation. MNA consists of the monitoring and testing of the progress of natural processes that degrade contaminants in situ, and biopiles and ENA consist of stimulating in situ microbial processes by either introducing new microorganisms (bioaugmentation) or modifying the environmental conditions to stimulate the growth of degrading organisms (biostimulation). 1,2 They are advantageous practices because they produce minimal waste and disturbance to the site but also minimise contact between remediation engineers and the contaminated site.
The microbial organisms and catabolic genes involved in the biodegradation of organic contaminants have been well characterised for a large variety of contaminants including polycyclic aromatic hydrocarbons (PAHs). PAHs are recognised lipophilic legacy organic pollutants present in crude oil and are produced during the combustion of coal and organic matterthey are therefore ubiquitous soil contaminants. They are usually present in their thousands in complex environmental samples but only a few, such as naphthalene, phenanthrene or benzo(a)pyrene, are recognized as carcinogenic or mutagenic and are currently on priority pollutant lists. Therefore, the biodegradation of PAHs by microbes has up until now been almost exclusively studied in vitro in microcosms and is usually demonstrated for a single compound or a very limited number of compounds. This is, however, an overly simplistic view of what happens in situ where multiple contaminants are present and their fate likely to be interdependent.
The design of smart and efficient bioremediation solutions in soil requires extracting from the exhaustive knowledge on the degradation of complex samples the information that will enable enhancement of the degradation of listed contaminants. Eventually, the aim is also to optimise analysis carried out during site investigation to provide remediation practitioners with the necessary information to design remediation approaches. Before this is achievable, however, understanding what the relevant factors governing biodegradation of PAHs in soil are is crucial. These factors will include: soil characteristics, physicochemical characteristics (e.g. pH and organic matter content), microbial ecology and the nature of the contaminations (PAH distribution and concentration, but also the presence of organic and inorganic co-contaminants).
The recent rapid technological advances in nucleic acid sequencing have enabled the high-resolution characterisation of microbial communities in PAH contaminated soils. Specically, metataxogenomic and metagenomic approaches have demonstrated that contamination induces a reduction in species richness and enriches the population with species that are adapted to hydrocarbon degradation. 3 Consequently, initial microbial ecology could also be linked to the potential for PAH degradation, 4 demonstrating that microbial ecology carries meaningful information on the potential for bioremediation in soil. However, without high resolution information on the contamination prole, notably the presence of other contaminants, this information might not be sufficient for the design of an efficient site-specic bioremediation. Comprehensive two-dimensional gas chromatography coupled with mass spectrometry (GC Â GC-MS) has allowed the near comprehensive characterisation of semi-volatile organic carbons (SVOCs) in tars and contaminated soils from former manufactured gasworks (FMGs). [5][6][7] The coupling of two columns allows for a two-dimensional separation across a retention plane rather than along a retention line. The peak capacity of GC Â GC is several orders of magnitude higher than traditional techniques and thousands of compounds can be resolved; coupled with the resolution power of mass spectrometry, it generates ultra-resolution chemical signatures of environmental samples. It has been used for source apportionment, 8 monitoring of bioremediation across SVOC classes 6 and estimation of ecotoxicity. 9 Integration of metataxogenomics and GC Â GC analysis has the potential to unveil information on the interactions between complex mixtures of environmental contaminants and microbial ecology never accessed before. Integrative multivariate statistical approaches for multi-omics data are being developed in biomedical applications. 10 For instance, the DIABLO 10 method builds on the Generalised Canonical Correlation Analysis, which integrates multiple datasets by nding principal components (latent variables) that maximize the covariance of scores between different datasets and the categorical outcome of interest. The resulting loading vectors are then constrained to give discriminants that correlate between these datasets.
Here, through the comparison of soils from two different FMGs, we present an example of statistical integration of chemical and molecular data in contaminated soils. Chemical and DNA extracts were analysed and processed through robust methods and pipelines 1-4 previously developed in our laboratories and statistical analysis was rst carried out independently. Multivariate analysis enabled forensic characterisation of the site by exploring intra-and inter-site variations in distributions of PAHs and other compounds. 2 16S rRNA sequencing data were explored for alpha and beta diversity analyses, 5 inferred metabolic pathway analysis 6 and differential abundance analysis of both species and metabolic pathways between sites. 5 Then, the N-integration algorithm DIA-BLO 7 was used to classify and discriminate features that correlate between the microbiome and chemical metadata including, for the rst time, high-resolution distribution of SVOCs obtained by GC Â GC-MS.

Samples
Samples from two different former manufactured gas plants were provided by collaborators. Very little information was provided about the samples so no spatial resolution has been attempted in this work. The samples were stored in plastic tubes at 4 C. Eighteen samples came from a site in the United Kingdom (COV site) and nine samples from a site in Switzerland (CH site). The COV soil samples, generally dark, compact and sticky in nature, were each sieved once with a 10 mm mesh. The CH samples, brown in colour and drier in nature, were each sieved through 10, 2.36 and 1.70 mm meshes. For each individual sample, the moisture content in percentage was measured by placing a subsample in an oven at 105 C for 24 h, and loss on ignition (LOI), also in percentage, was measured by placing a subsample in a furnace at 550 C for 2 h. No further soil characterisation was carried out at this stage.

Semivolatile organic compound analysis
Approximately 0.25 g of each soil sample was extracted via pressurised liquid extraction (PLE) with on-line silica gel clean-up. 5 The rst fraction (hexane) and second fraction (hexane : toluene) (v/v) (8 : 2) were collected together and concentrated to 1.15 mL. Quantication of 14 PAHs along with four surrogates was carried out using GC-MS. It was carried out in TIC or SIM mode depending on the concentrations. We rst validated the PLE method by extracting two of the COV samples six times and quantifying the 14 PAHs in duplicates. The relative standard deviation for the quantication of an individual PAH varied between 1% and 16.2%, with only one value over 10% demonstrating that the extraction was repeatable. Consequently, all other samples were only extracted once.
To optimise chromatographic resolution, we also employed a GC Â GC method we had previously developed for the exhaustive characterisation of environmental coal tar samples. 5 The method ensured optimal separation of PAH isomers but also separated alkanes and alkylated benzenes. Comprehensive semivolatile signatures of the extracts 5 were obtained using a LECO (St. Joseph, Michigan) time of ight mass spectrometer, model Pegasus 4D, connected to an Agilent 7890A gas chromatograph equipped with a LECO thermal modulator. The column set-up employed is known as reversed phase, with a rst dimension capillary column (TR50-MS; 30 m Â 0.25 mm i.d. Â 0.25 mm lm thickness; Thermo) that was more polar than the second capillary column (Rxi-5Sil; 2 m Â 0.25 mm i.d. Â 0.25 mm lm thickness; Thames Restek). In this set-up, PAHs have a short retention time in the second dimension while alkanes are retained for longer. Alkylbenzenes, alkynes and alkenes, in order of polarity, elute in the retention space between PAHs and alkanes. 5 To be input into multivariate analysis, the GC Â GC data must rst be aligned between samples. Alignment can be carried out either by aligning chromatograms or peak tables (the output of GC Â GC data processing). Here, we employed a combination of peak picking using the LECO ChromaTOF soware and peak alignment using the R code R2DGC. 11 We optimised the data processing and R2DGC parameters but decided not to carry out any manual tidying up of the peak tables, which were le. Indeed, the peak tables contained over 500 compounds each, and a manual check of each peak for truly exhaustive analysis was not feasible. Two instrumental duplicates for each sample were analysed. For one COV soil sample, ve replicate PL extracts were analysed in duplicate. Additional instrumental replicates were included for some samples because the second-dimension capillary column needed replacing during the study, shiing both retentions; hence samples were re-analysed aer the change in columns. All peak tables were aligned to the CH peak table that contained the most peaks (top le in Fig. 1). The alignment of 68 peak tables was carried out twice with missing value limits equal to 0.1 i.e. a compound had to appear in at least 10% of the peak tables to be included in the alignment table, and 1 where a compound had to be present in all peak tables.
All methods are described in detail in the ESI. †

Transition metal analysis
The concentrations of lead, iron, cadmium, chromium, zinc, copper, and nickel were determined using inductively coupled plasma optical emission spectrometry (ICP-OES) and the methods are discussed in the ESI. †

Genomic DNA analysis
Genomic DNA was extracted twice from soil samples (0.25 g fresh weight); once for quantitative PCR (qPCR) for the alkB 12 and PAH RHD GN and PAH RHD GP genes, and a second time for 16S (V3 and V4 regions) metataxogenomic library preparation and sequencing. The detailed methods and bioinformatics workow are presented in the ESI. †

Statistical analysis
All statistical analyses were performed in R.
Hierarchical clustering analysis (HCA) of the GC Â GC alignment table was performed by calculating Manhattan distance between the samples each comprising 961 metabolites. Prior to analysis, zero values were replaced by one third of the smallest value in the table and each peak area was normalised to the calculated total peak areas for a given sample. Aerwards, we performed agglomerative clustering using complete linkages utilising R's hclust() function. For visualisation, we used R's dendextend package. 13 We then used the color_branches() function from the same package to cluster both the terminal leaves of the dendrogram and the edges leading to the samples. The vegan package was used for alpha and beta diversity analyses. For alpha diversity measures we have used: rareed richnessthe estimated number of species aer rarefying the abundance table to minimum library size; Shannon entropya commonly used index to measure balance within a community; Simpson indexa measure of dominance that weighs towards the abundance of most common OTU and is less sensitive to rare OTUs, Pilou evenesscompares the actual diversity values to the maximum possible diversity value, is constrained between 0 and 1.0, and the more variation in abundance between different OTUs in the community, the lower its value; and Fisher's alphaa parametric index of diversity that assumes the abundance of OTUs following the log series distribution. Ordination of the OTU table in reduced space (beta diversity) was done using Principal Coordinate Analysis (PCoA) plots of OTUs using three different distance measures that were made using Vegan's cmdscale() function: (1) Bray-Curtis is a distance metric that considers only OTU abundance counts, (2) unweighted UniFrac is a phylogenetic distance metric that calculates the distance between samples by taking the proportion of the sum of unshared branch lengths in the sum of all of the branch lengths of the phylogenetic tree for the OTUs observed in two samples, and without taking into account their abundances and, (3) weighted UniFrac is a phylogenetic distance metric combining phylogenetic distance with relative abundances. This places emphasis on dominant OTUs or taxa. UniFrac distances were calculated using the phyloseq package. 14 Analysis of the variance for explanatory variables (or sources of variation) was performed using Vegan's adonis() against distance matrices (Bray-Curtis/ unweighted UniFrac/weighted UniFrac). This function, referred to as PERMA-NOVA, ts linear models to distance matrices and uses a permutation test with pseudo-F ratios. To give an account of environmental ltering (phylogenetic overdispersion versus clustering), phylogenetic distances within each sample were further characterised by calculating the nearest taxa index (NTI) and net relatedness index (NRI). This analysis aimed to determine whether the community structure was stochastic (overdispersion and driven by competition among taxa) or deterministic (clustering and driven by strong environmental pressure). The NTI was calculated using mntd() and ses.mntd(), and the mean phylogenetic diversity (MPD) and NRI were calculated using mpd() and ses.mpd() functions from the picante package. 15 NTI and NRI represent the negatives of the output from ses.mntd() and ses.mpd(), respectively. Additionally, they quantify the number of standard deviations that separate the observed values from the mean of the null distribution (999 randomisation using null.model-'richness' in the ses.mntd() and ses.mpd() functions and only considering taxa as either present or absent regardless of their relative abundance). Based upon the recommendations, 16 only the top 1000 most abundant OTUs were used for the calculations.
Discriminant analyses between the two sites were considered using microbiome data alone, and then together with the meta data (Table 1). For the former case, we used Sparse Projection to Latent Structure-Discriminant Analysis (sPLS-DA) with the R's mixOmics package. 10 The procedure constructs articial latent components of the predicted dataset (OTUs table denoted X(N Â P)) and the response variable (denoted Y with categorical information of samples, e.g. CH and COV) by factorizing these matrices into scores and loading vectors in a new space such that the covariance between the scores of these two matrices cov(X h a h ,Y h b h ) in this space is maximized under two constraints: ka h k 2 ¼ 1; and ka h k 1 # l, where a h and b h are the corresponding loading vectors for X and Y, and h represents the number of components (akin to PCA analysis). To integrate meta data further, we utilised DIABLO from R's mixOmics package. We have combined represents the microbiome data, and X (2) represents meta data (moisture content, LOI, 14 PAHs concentrations and GC Â GC signatures whether considered or not) ( Table 1). An additional matrix X (3) required for the algorithm to enable the discriminant analysis is a dummy matrix of the classes the samples belong to (whether COV or CH, and equivalent to matrix Y in the sPLS-DA case). The algorithm then constructs articial latent components of the datasets by factorizing the datasets into scores and loading vectors in new space such that the covariance between the scores of these matrices in this space is maximized, i.e., for q ¼ 1, 2, ., Q, DIABLO solves for each where l (q) is the penalization parameter, a (q) h is the loading vector on component h associated to the (deated) matrix X (q) h of the data set X (q) , and C ¼ {c q,j } q,j is the design matrix, where Q ¼ 3. C is a Q Â Q matrix that species whether datasets should be correlated and includes values between zero (datasets are not connected) and one (datasets are fully connected). The rst constraint ka (q) h k 2 ¼ 1 (similar to sPLS-DA) ensures the loading vector has unit magnitude (requirement of the procedure) and the second constraint ka (q) h k 1 # l (q) (also called l 1 penalty) ensures that for the features that do not vary between the categories, the corresponding loading vector coefficients go to zero. This is done using the sparsity control parameter l (q) in the above equation, and adjusting it enforces shrinkage of the loading vector coefficients. According to the recommendations given in the mixOmics package (http://www.mixomics.org), before applying the sPLS-DA and DIABLO procedures, we pre-lter 1% of the lowest OTUs and then perform TSS +  CLR (Total Sum Scaling followed by Centralised Log Ratio) normalisation. For the design matrix in DIABLO, mixOmics suggests that a full weighted design where c q,j ¼ 0.1 between data matrices and 1 for the outcome leads to a trade-off between maximizing correlation between datasets and maximizing the discrimination of the outcome, and therefore we used this, i.e., C ¼ To predict the number of latent components (associated loading vectors) and the number of discriminants, for sPLS-DA, we used the perf.plsda() and tune.splsda() functions, whereas for DIABLO, block.splsda() and tune.block.splsda() functions were used, respectively. In both cases, we ne-tuned the model using leave-one-out cross-validation by splitting the data into training and testing sets and then nding the classication error rates employing two metrics, overall error rates and balanced error rates (BER), between the predicted latent variables with the centroid of the class labels (categories considered in this study) using the max.dist (which gave the minimal classication rate for the scenarios considered in this study). BER accounts for differences in the number of samples between different categories. Other than TSS + CLR normalisation for the abundance table, log 10 normalisation for qPCR data was used, and GC Â GC signatures were normalised using pareto scaling. When displaying boxplots, pair-wise ANOVA or Kruskal-Wallis were performed taking two categories at a time, and where signicant (p # 0.05), they were joined together by a line and signicance was plotted on top (*: 0.01 # p < 0.05; **: 0.05 # p < 0.001; ***: p # 0.001).

Comprehensive SVOCs characterisations of contamination
Comprehensive two-dimensional analysis coupled with time-of-ight mass spectrometry analysis was carried out on the soil samples. GC Â GC-TOFMS enables an exhaustive characterisation of the distribution of semi-volatile organic compounds (SVOCs) in the samples as a novel "omics" dimension in the study of bioremediation of complex contaminated soils. Fig. 1 shows the GC Â GC chromatograms for representative samples of CH and COV. Visually, COV samples appeared more heavily contaminated with PAHs than the CH samples, which was conrmed using the one dimensional GC quantication (Fig. 2). The proportion of substituted PAHs was also higher in the COV samples. By comparison, the CH samples had higher proportions of alkanes, alkenes, alkynes and alkylbenzenes. This could possibly be explained by a difference in the processes involved in the coal gasication as we demonstrated previously in a forensic study of coal tar samples from FMGPs. 8 Higher proportions of PAH parents and lower proportions of alkanes were associated with high temperature gasication, while high proportions of petroleum-like hydrocarbons can be associated with carburetted water gas plants.
While the potential of GC Â GC for comprehensive analysis of SVOCs in contaminated samples is evident in terms of the resolution of the analytical method, the availability of fully comprehensive, automated and reproducible data processing and alignment for GC Â GC data is arguably the biggest bottleneck to its use in omics-like studies. The accuracy of the chosen processing method was evaluated using replicates and multivariate analysis. Alignment of 68 peak tables with a missing values limit of 10% returned 961 compounds. To evaluate the robustness of the data processing, HCA was carried out on the samples using the alignment table (Fig. S1 †). The clustering clearly separated the two sites, with only one COV sample (both instrumental replicates) clustering with CH samples. The GC Â GC chromatogram for this extract is presented in the bottom right of Fig. 1.
One CH sample clustered away from both sites' clusters; its chromatogram is presented in the top le of Fig. 1. In most cases, a sample nearest neighbour was its instrumental replicate. The ve extracts from the same soil sample (COV-O5 in Fig. S1 †) clustered altogether. Replicate runs before and aer the column change, however, did not cluster near each other (e.g. see COV-18 and COV-17 on Fig. S1 †). The data alignment was therefore deemed successful although it appeared somewhat susceptible to retention shi. Alignment with the missing values limit of 100% returned 58 compounds (Table S1 †). While we were able to ascertain during GC-MS quantication that all of the samples contained the 14 quantied PAHs, the alignment only returned seven of those. Similarly, while ve surrogates and one internal standard were added to each sample, only three were in the nal alignment table and their peak areas showed great variations. Recovery quanti-cation for the surrogates was carried out using GC-MS and showed good reproducibility (relative standard deviation around 10% for all surrogates) (data not shown). The errors in the GC Â GC data, therefore, came from the data processing. Sources of errors could be related to a degree of misalignment but were also expected to be related to the peak nding algorithm. Deconvolution and reconstitution of peaks across multiple modulation times might lead to two types of errors: (1) one peak can be incorrectly split into two peaks and (2) two or more peaks can be articially combined as one. The rst error is acknowledged in most data processing pipelines for GC Â GC data 11,17 as it occurs in metabolomics samples. Here we used the PrecompressFiles function of R2DGC to correct the peak tables for the rst error prior to alignment. The concentrations of surrogates, however, was very high compared to any other compound in the samples and their signals saturated the detectors, which increased the possibility of peak splitting and might explain the non-repeatability of their peak areas. The second error is less common in metabolomics as compounds have more distinct mass spectra. In hydrocarbon analysis, however, position isomers are very common, have very similar spectra and elute near each other and are therefore more likely to be integrated as a single peak. This can be minimised by optimising the parameters of data processing such as minimum peak widths in both dimensions and signal-to-noise ratio; it, however, remains an issue in samples with large dynamic ranges between compounds. The results therefore indicated that the probability of one of these errors to occur for any one compound in one of the 68 peak tables was high. Further optimisation of the data processing needs to be carried out for accurate and precise exhaustive characterisation of SVOCs.

Bacterial diversity in CH and COV soils
While 16 different samples were collected for the COV site and 9 for the CH sites, not all extracted DNA was successfully amplied. Additionally, some samples were extracted in replicates (see Fig. 2b) and variations in microbial communities between replicates was lower than the inter-sample variations.
Five measures of alpha diversity clearly showed that the CH samples had signicantly higher species richness than COV samples had (Fig. 2). While considering the top 25 most abundant orders, we can notice that the proportions of gammaproteobacteria, alphaproteobacteria and betaproteobacteria were consistent in the CH samples but varied in the COV samples. Comparison with the total concentrations of 14 PAHs suggested a possible connection between the levels of PAH contamination and the abundance of the dominant species in the samples (Fig. 2b), particularly the proportions of gammaproteobacteria. Two samples, however, showed contrary trends. One of them presented low total PAH concentration and a low proportion of gammaproteobacteria. In this sample, recovery values for the PAH surrogates were very low (between and 20 and 35%) indicating an issue during PLE extraction and therefore an underestimation of total PAH concentration.
NMDS (using Bray-Curtis distance) of sample dissimilarities showed separation between the two sites with more inter-sample variability in COV (Fig. 2c). PAH concentrations in COV were one order of magnitude higher on average than in CH samples and also presented a greater variance. This may explain the dispersion in the beta diversity space resulting in several ecological niches. The pathways analyses (not shown here) indicated that a signicantly higher proportion of OTUs in the COV samples were available in the reference SilvaMod 123 database (for which reference pathways are available in the Tax4Fun package) than in CH samples, which suggested that the OTUs that inhabited the more contaminated sites are more ubiquitous in nature and thus well characterised with their pathways available. The gene content for each OTU in each sample was inferred from the closest sequenced genome using the Tax4Fun package and we investigated particular pathways from the KEGG database linked to contaminant degradation 4 and related metabolic pathways. Out of the 12 tested pathways, seven presented signicantly different average relative abundance between COV and CH. COV presented statistically more potential for degradation of hydrocarbon contaminants via the toluene (ko00623) and naphthalene (ko00626) pathways and CH via the drug metabolism (other pathways) (ko00983) and PAH (ko00624) degradation pathways (Fig. 3). Incidentally, COV samples also showed signicantly more potential for the propanoate and pyruvate metabolisms and for glycolysis. Next, we explored the inuence of the environment on the assembly of microbial communities. We applied two phylogenetic alpha diversity indices, NTI and NRI (Fig. 2d), to explore phylogenetic clustering at both local and global scales. Positive values of NTI indicate that species co-occur with more closely related species than expected by chance, with negative values suggesting otherwise. This is mainly because NTI measures tip-level divergences (putting more emphasis on terminal clades and akin to "local" clustering) in phylogeny while NRI measures deeper divergences (akin to "global" clustering). For both NTI and NRI, values >+2 indicate strong environmental pressure, values <À2 indicate strong competition among species as the driver of community structure, and the values in between are a gradient between the two. The results indicated that the communities in CH samples are more deterministic, and inuenced by the environment, whereas COV sites are driven by the competitive exclusion principle where the species that can outcompete others in a given niche dominate, leading to more dispersion in the phylogenetic tree. This phenomenon also supports the much higher variability in beta diversity space for COV samples (Fig. 2c).

Discriminant analysis of microbiome and chemical data
Through discriminant analysis based on the microbiome alone (sPLS-DA), we found a total of 10 discriminating OTUs (ve for the rst two components each). PLS-DA (Fig. 4a) shows ordination of the samples using all of the OTUs and sPLS-DA (Fig. 4b) shows ordination of the samples using only the discriminating OTUs. By comparing these two gures, it can be seen that in the reduced space, the sample dissimilarities (between COV and CH) are conserved. Three out of the ten OTUs were greater in abundance for CH, while the rest were greater for COV. Clustering of the samples according to these OTUs showed that four out of the ve discriminating features from the rst component (OTUs 146, 1214, 384 and 333; this group is henceforth referred to as Group A) were signicantly more abundant in COV than in CH, and the last OTU (OTU_14) was signicantly more abundant in the CH site. OTUs 333, 384 and 1214 belong to the Clostridiales order. OTUs 333 and 384 are both plant biopolymer degraders of the Ruminococcaceae family, which might relate to the organic matter present on the site, while OTU_1214 belongs to the genus Bacillus, which is known for both PAH and alkane degradation. 18 OTU_146 belongs to the Desulfotomaculum genus, in which some species have been found to degrade cresol in sulfate-degrading condition 19 and OTU_4 belongs to the genus Sulfuritalea, oxygen independent aromatic compound degrading bacteria. 20 While none of the Group A OTUs are amongst the most abundant in the samples, OTU_14 is amongst the most abundant in the CH samples (results not shown). These 5 OTUs might be markers of the difference in hydrocarbon degradation mechanisms between the two sites because of differences in concentration and in soil properties and quality.
A rst DIABLO analysis (hitherto referred to as DIABLO 1) (Fig. 5) was carried out by integrating additional metadata to the microbiome data: the concentrations of 14 PAHs, LOI, moisture, heavy metal concentrations and qPCR data for Gram negative PAH degrader (PAH RHD a GN), Gram positive PAH degrader (PAH RHD a GP) and alkane degrader (alkB) gene abundance (Table 1). We found two components to reduce the classication error rates resulting in 10 discriminating OTUs and 10 discriminating meta data features (ve each for the two components). Group A represented four of the ve OTUs that were selected in the rst component (Table 2). This conrmed that in both cases, the rst component isolated the OTUs that were the most representative of the differences between COV and CH and particularly the ones that were much more abundant in COV than in CH. The nal OTU, OTU_147, was identied as the actinobacteria, Micromonospora sp. WMMB 894, for which little information is available in the literature. These ve OTUs are further referred to as Group A 0 .
Consequently, this overlap between the rst component OTUs for the sPLS-DA and DIABLO 1 allowed us to interpret the discriminating metadata selected for the rst component in DIABLO 1 as the ones that were the most representative of the differences between CH and COV. These included moisture, naphthalene and dibenzo(a,h)anthracene, cobalt and iron. Given that moisture content between the two sites is signicantly differentthe average moisture in COV samples was 19.5% AE 0.8 (95% condence interval) and the CH average moisture was 8.1% AE 1.8 (95% condence interval)it is likely to be a strong driver for bacterial community composition. Naphthalene and dibenzo(a,h)anthracene were not the PAHs with the highest concentrations in the samples but were selected   Table 2), with both rows and columns ordered using hierarchical (average linkage) clustering to identify blocks of features of interest. Group A (see Table 2) is indicated in a grey box.

Paper Faraday Discussions
This   Table 2), with both rows and columns ordered using hierarchical (average linkage) clustering to identify blocks of features of interest. Group A 0 (Table 1) is indicated in a grey box. (c) shows the significant correlations (À0.6 < R > 0.6) between the features as calculated by the algorithm.      statistically to represent low molecular weight and high molecular weight PAHs, respectively, as markers of the difference in PAH concentration between the two sites. Similarly, the heavy metal concentrations were much higher in COV samples than in CH samples and this is reected in the presence of iron and cobalt as signicant contributors to the rst component. Noteworthy was the fact that D8-naphthalene was the surrogate with the lowest repeatability and therefore, although naphthalene in an aged contaminated soil is less likely to evaporate than a freshly spiked surrogate during extraction, the concentration of naphthalene was likely to be underestimated in our sample and this should be addressed in the future considering its importance in the analysis. A second DIABLO analysis (DIABLO 2) (Fig. 6) was carried out introducing to the previous dataset, the 58 compounds that were found in all samples by alignment of the GC Â GC data. For the rst component, once again Group A 0 was selected as discriminant features along with the same metadata features. Twodimensional hierarchical clustering (2D-HCA) of the samples and variables for the rst and the second components in all three cases (Fig. 4c, 5b and 6b) presented ideal site separation of the samples and the Group A 0 OTUs and the metadata that got selected for the rst component drove this clustering.
Pairwise correlations ( Fig. 5c and 6c) showed that the discriminating features from the rst component (in both DIABLO 1 and DIABLO 2) were also highly correlated to each other, demonstrating the link between the microbiome and the stressors most responsible for the difference between these sites.
Since all three discriminatory analyses above rank the components based on percentage variability explained by the components, the features selected in lower components (i.e. the second component) serve as a cue to elucidate ner differences between samples as opposed to the rst component.
In the sPLS-DA analysis, while two of the OTUs amongst the ve OTUs selected for the second component (Table 2) are clearly more abundant in one site than the others (OTU_3 in CH and OTU_650 in COV), the three others have more nuanced distribution. OTUs 683 and 498 are more abundant in COV overall but are also present in high abundance in some CH samples, while OTU_164 is more abundant in CH on average but is abundant in some COV samples. These three OTUs that are signicantly present in both samples might be indicative of change of principal function of the communities as the hydrocarbon distribution changes. OTU_650 was identied as belonging to the genus of Stenotrophomonas (gammaproteobacteria), which have been found to be PAH degraders. 21 OTU_3 belongs to the genus Sandaracinobacter, a genus of aerobic anoxygenic phototrophic extremophiles. 22 OTU_683 is a member of the Methylobacteriaceae family, some strains of which are known to exhibit high tolerance to heavy metal contamination and have been used benecially in the bioremediation of contaminated environment. 23,24 OTU_683 was the only OTU from the second component of the sPLS-DA to also be selected for the second component in DIABLO 1. Three other OTUs (124, 15 and 448) selected for the second component in DIABLO 1 were signicantly and positively correlated with each other and with the (log 10 ) qPCR results for Gramnegative PAH and alkane degraders. The same three OTUs correlated negatively with OTU_683 and with cadmium (Fig. 5c). The abundance of the aforementioned OTUs seems to drive the clustering within the COV samples in the 2D-HCA, demonstrating two possible regimes of hydrocarbon degradation, which, however, do not appear to be related to the concentration of PAHs. OTUs 124, 15 and 448 all belong to known aromatic hydrocarbon degrading genera of Achromobacter, Pseudoxanthomonas and Sinorhizobium. [25][26][27][28][29] The abundance of these OTUs was low in samples from COV where cadmium concentration was high, which appeared to favour OTU_683. In the DIABLO 2 analysis, the discriminating features of the second component were entirely different than those of DIABLO 1.
The ve OTUs formed two distinct groups negatively correlated to each other: OTUs 635 and 344 lying in one group and OTUs 116, 255 and 1397 lying in the other group. The latter all belong to the Sphingomonadales and were also negatively correlated to the discriminating metadata features: four long chain branched alkylbenzenes and 2,4-di-tert-butylphenol (Fig. 6c). The abundance of these chemicals also appeared to upregulate the abundance of OTUs 635 and 344. OTU_344 was identied as belonging to the Extensimonas genus, aerobic chemoorganotrophs that have been isolated before from wastewater. 30 These highlighted again two different regimes in the COV samples, depending this time on the abundance of hydrocarbons that were not PAHs. Samples in COV with high abundance of these hydrocarbons clustered closer to the CH samples (in which the abundances were high too) than to other COV samples. The integration of the GC Â GC data does not affect the rst component, demonstrating that the differences in moisture level, heavy metals and PAH concentrations explains the differences between the two sites more than the differences in the exhaustive SVOC signatures. It highlighted, however, through the second component, the inuence on the autochthonous microbiome of compounds that would not have normally been measured or taken in consideration during site investigation.

Conclusions
Our results demonstrated the usefulness of "multi-omics" approaches in the context of contaminated soils to correlate the abundance of chemical contaminants to the abundance of microbial OTUs. At the same time, utilising the ecological principles, we have highlighted the deterministic nature of microbial communities, dependent on the presence of chemical contaminants. Whilst the discriminant analysis approaches, sPLS-DA and DIABLO, adopted in this study give a reduction of large microbial feature space to the subset of species that form an association with the chemical contaminants and other meta data, care must be taken to interpret causality, primarily because we have only considered sites with spatial variabilities and the patterns found were predominantly inter-site discriminants. It was already successful, however, in demonstrating the efficacy of qPCR analysis, where causality is already known, as a rapid screening tool for hydrocarbon biodegradation potential on a site. This preliminary study has been useful in rstly validating our experimental methods such as exhaustive SVOC and DNA extraction from highly contaminated soil samples and evaluating our in silico pipelines for data processing, highlighting notably the potential of the R2DGC free package for alignment of GC Â GC peak tables but also its limitations and those of peak picking algorithms, which will need improving for true comprehensive studies. To give the mechanistic underpinnings, a thorough exploration such as in the context of time series microcosms is required as well as subsequent carefully designed experiments that are informed by the ndings of this study. Furthermore, the metabolic potential of the microbial community in this study is explored through a proxy method (Tax4Fun) dependent on the availability of the reference pathway database. Although approaches such as shotgun metagenomics give the actual metabolic potential, we have demonstrated that with high representation of taxa from contaminated soils in the reference database, Tax4Fun offers a cost-effective solution with reasonable resolution for biodegradation pathways. Thus, for contaminated sites, coupling microbial community surveys using 16S rRNA with GC Â GC MS offers a whole that is greater than the sum of its parts.

Conflicts of interest
There are no conicts to declare.