Hae-Won
Uh
*^{a},
Lucija
Klarić
^{bc},
Ivo
Ugrina
^{bde},
Gordan
Lauc
^{bf},
Age K.
Smilde
^{g} and
Jeanine J.
Houwing-Duistermaat
^{ah}
^{a}Department of Biostatistics and Research Support, University Medical Center Utrecht, Utrecht, Netherlands. E-mail: h.w.uh@umcutrecht.nl
^{b}Genos Glycoscience Research Laboratory, Zagreb, Croatia
^{c}MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh, UK
^{d}University of Split, Faculty of Science, Split, Croatia
^{e}Intellomics Ltd, Croatia
^{f}Faculty of Pharmacy and Biochemistry, University of Zagreb, Zagreb, Croatia
^{g}Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, The Netherlands
^{h}Department of Statistics, University of Leeds, Leeds, UK
First published on 10th March 2020
Rapid progress in high-throughput glycomics analysis enables the researchers to conduct large sample studies. Typically, the between-subject differences in total abundance of raw glycomics data are very large, and it is necessary to reduce the differences, making measurements comparable across samples. Essentially there are two ways to approach this issue: row-wise and column-wise normalization. In glycomics, the differences per subject are usually forced to be exactly zero, by scaling each sample having the sum of all glycan intensities equal to 100%. This total area (row-wise) normalization (TA) results in so-called compositional data, rendering many standard multivariate statistical methods inappropriate or inapplicable. Ignoring the compositional nature of the data, moreover, may lead to spurious results. Alternatively, a log-transformation to the raw data can be performed prior to column-wise normalization and implementing standard statistical tools. Until now, there is no clear consensus on the appropriate normalization method applied to glycomics data. Nor is systematic investigation of impact of TA on downstream analysis available to justify the choice of TA. Our motivation lies in efficient variable selection to identify glycan biomarkers with regard to accurate prediction as well as interpretability of the model chosen. Via extensive simulations we investigate how different normalization methods affect the performance of variable selection, and compare their performance. We also address the effect of various types of measurement error in glycans: additive, multiplicative and two-component error. We show that when sample-wise differences are not large row-wise normalization (like TA) can have deleterious effects on variable selection and prediction.
Fig. 2 The flow diagram identifying the steps involved in a preparation of data for statistical analysis. |
In general, between-subject differences in total abundance of raw glycomics data are large, and even technical replicates show substantial differences due to measurement errors. To reduce these differences (or variances), applying a log-transformation to the cleaned raw data is by far the easiest way, as in other omics data analysis. Traditionally, however, the difference in total abundance per subject is forced to be exactly zero by scaling each sample to have the sum of all glycan intensities equal to 1 (or 100%). This results in so-called compositional data, also present in microbiome data analysis. Until now, there is no clear consensus on the appropriate normalization method for glycomics data.
The compositional nature of the data renders many standard multivariate statistical methods inappropriate or inapplicable, and the complications from such data are well recognized in the statistics literature.^{8,9} For example, one cannot simply calculate the Pearson's correlations between two compositional components. To illustrate this problem, consider a simple, toy example of only two glycans representing a whole glycome of a person. Because of the constraint to the sum of a 100%, when the level of one glycan increases, the level of another must decrease. These two glycans are therefore negatively correlated.^{10} Such changes in correlation structure are shown in Fig. 3. In Table 1, we divide the six normalization methods considered here into two classes: row-wise (making samples comparable, Fig. 3(c)–(f)), and column-wise (making glycans comparable, Fig. 3(g) and (h)). Moreover, as can be seen in Fig. 3(a), glycans are highly correlated. To avoid overfitting and the multicollinearity problem, penalized regression can be considered. The effects of column-wise normalization such as centering and scaling will lead to a corresponding change in the scale of the coefficients and standard errors, but no change in the significance or interpretation. In contrast, if the compositional glycan data is used as covariates in regression analysis, the lasso regression should not be directly applied.^{11}
Class | Normalization | Abbreviation | Description |
---|---|---|---|
a Aitchison^{12} introduced transformations based on ratios: the additive log-ratio transformation (ALR) and the centred log-ratio transformation (CLR). b By taking the logarithm of RP, the additive log-ratio (ALR) transformation is obtained. | |||
Row-wise | Total area | TA | Each glycan peak is divided by total abundance per subject, resulting compositional data. |
Log-transform of TA | logTA | To achieve less-skewed distribution. When centering is applied to each sample, it is equivalent to the centred log-ratio (CLR) transformation.^{a} | |
Reference peak^{b} | RP | Each glycan peak is divided by the most abundant glycan peak per subject. | |
Median quotient | MQ | To remove the bias due to the relative abundance of glycan intensities. | |
Column-wise | Median scaling | MS | Each glycan peak is subtracted by its median and divided by the interquartile range (IQR). |
Multivariate quantile normalization | MQN | Column-wise adaptation of quantile normalization in gene expression data. |
Another important issue to be addressed is measurement error in glycans. Compared to more abundant glycans, IgG glycans of low-abundance appear to be measured with up to 50% of measurement error, indicating an additive error (unpublished data). In gene expression data, where measurement error is approximately constant over a range of intensity levels near zero, it appears to be proportional to intensity level at large intensity levels,^{13,14} which might imply multiplicative error. Since the specific measurement error structure in the real data is not well studied, we consider three models: additive, multiplicative, and two-component measurement error in glycan measurements, the last containing both additive and multiplicative error components. We study the effects caused by measurement error in our simulation studies. Given that the real correlation structure between glycans is not known, to assess the influence of different normalizations on discovery of glycan biomarkers we simulated the data to mimic the real data and simulated association with an outcome. In short, we first generate the glycans that mimic the correlation structure of real glycan data, namely the log-transformed glycan measurements from the Orkney Complex Disease Study (ORCADES).^{15} Next, outcome variables are generated assuming that different sets of glycan combinations are associated with the outcome. In addition, based on different error models the corresponding error-contaminated datasets are generated. For each of simulated glycan datasets (with or without error) six different normalization methods are applied, and variable selection is employed. Finally, the impact of normalization on the performance of variable selection is assessed via extensive simulations reflecting various scenarios as described in Section 3.
The structure of the paper is as follows. In the next section, a detailed description of the considered normalization methods is given, followed by discussion on measurement error in glycomics data. In Section 3 we present our simulation study and evaluate the robustness and efficiency of the normalization methods regarding variable selection. Next, the analysis of glycan measurements from the ORCADES study^{15} is considered in Section 4, where glycan variable selection is performed with age as an outcome. Finally, in the last section we give some insights and recommendations based on the conducted simulation study.
logTA(x_{ij}) = log(TA(x_{ij})). |
RP(x_{ij}) = x_{ij}/x_{is}. |
MQ(x_{ij}) = x^{q}_{ij}/median(x^{mq}_{i}). |
MS(x_{ij}) = {x_{ij} − median(x_{j})}/IQR(x_{j}). |
(_{1},…,_{p}) = (x_{1},…,x_{p})/(x_{1} +⋯+ x_{p}), |
Cov(_{1}, _{1} +⋯+ x_{p}) = 0. |
Cov(_{1}, _{2}) +⋯+ Cov(_{1}, _{p}) = −Var(_{1}). |
Assuming X has a multivariate normal distribution with mean μ and covariance matrix Σ, we consider a regression of a response Y on a predictor or covariate X. Instead of observing X, we observe W: i.e., error-free data (Y, X) versus the error-contaminated data (Y, W). First, for additive error we have W_{A} = X + U_{1}, where U_{1} is additive error independent of X. U_{1} is normally distributed and has mean zero and covariance matrix Σ_{U1}. For simple linear regression the effect of having additive measurement error in a covariate is said to be an underestimate of the coefficient, known as the attenuation bias. The effects can vary depending on simple or multiple regression, and whether a covariate measured with error is univariate or multivariate. For our simulation study, we consider a diagonal matrix for uncorrelated error as well as a full matrix for correlated errors. For the multiplicative measurement error, we have W_{M} = XU_{2}, where U_{2} is multiplicative error. It indicates that the largest observed values are very far from the true values. If and U_{2,j} = exp(U_{j}), for j = 1,…,p, then U_{2} has a multivariate log-normal distribution with mean zero and covariance matrix Σ_{U}. Lastly, the two-component model or Rocke–Lorenzato model^{13,22} containing both additive and multiplicative error is as follows: W_{MA} = XU_{2} + U_{1}, where U_{1} and U_{2} are independent errors. In the univariate case, w = xσ_{2}^{2} + σ_{1}^{2}, and this implies Var(w|x) = x^{2}σ_{2}^{2} + σ_{1}^{2}. For sufficiently small values of x, Var(w|x) is similar to σ_{1}^{2}, while for sufficiently large values of x, Var(w|x) is similar to σ_{2}^{2}.^{21} This behaviour in the multivariate case will be studied using simulations.
y = Xβ + ε, | (1) |
Among many methods, to achieve good prediction, to avoid overfitting, and to obtain an interpretable model, we consider the Least Absolute Shrinkage and Selection Operator (lasso) proposed by Tibshirani.^{23} This method considers both continuous shrinkage and variable selection. The lasso is a penalized least squares procedure that minimizes RSS = (ỹ − Xβ)^{T}(ỹ − Xβ) where ỹ = y − y1_{n}, subject to the non-differentiable constraint expressed in terms of the L_{1} norm. The lasso estimator is given by
(2) |
To assess how the results of different scenarios will generalize to external datasets, we investigate performance of predictive model through estimating accuracy of a predictive model applied to a new independent data. We therefore first build the model (or perform variable selection) using a training and test dataset, and then validate the model composed of the selected variables using an external validation dataset. To summarize the results, we report the numbers of correctly and incorrectly selected variables, and we quantify the prediction error,^{24} defined by squared root of the average error in the prediction of y given X for future cases not used in the construction of a prediction equation. Formally, if (X) is the predicted values constructed using the present data, the prediction error can be written as
(3) |
(ii) Generate the error-contaminated glycan data W. We simulate 1000 datasets consisting of 2n observations and p glycans generated with three different error models – additive, multiplicative, two-component. The parameters of the covariance matrix Σ are shown in Table 2. The diagonal elements are denoted as σ_{ii} = σ_{i}^{2}, and the off-diagonal elements as σ_{ij}(i ≠ j). We consider both correlated and uncorrelated covariance matrices.
Error type | Additive | Multiplicative | |||
---|---|---|---|---|---|
σ _{ ii } | σ _{ ij } | σ _{ ii } | σ _{ ij } | ||
Correlated | E1 | 1/4 | 1/8 | 0.1 | 0.05 |
E2 | 1 | 0.5 | 0.01 | 0.005 | |
Uncorrelated | E1_{un} | 1/4 | 0 | 0.1 | 0 |
E2_{un} | 1 | 0 | 0.01 | 0 |
(iii) Simulation of trait (y). Based on linear regression y = Xβ + ε (eqn (1)) with , and the coefficients β were set as follows:
• 3 coefficients fixed: β_{j∈{1,2,10}} = (1,0.5,2)^{T} and all other β_{j{1,2,10}} = 0.
• 6 coefficients fixed: β_{j∈{1,2,10,15,16,17}} = c(1,0.5,2,3,0.5,1)^{T} and all others are set to zero.
The descriptives of y are given in Table S1 (ESI†).
(iv) Six normalization methods. For each of the error-free and -contaminated glycan datasets – glycan data generated from the steps (i) and (ii) – the TA, logTA, RP, MS, MQ, or MQN transformation is applied.
(v) Regression. Using n training sets we apply lasso penalized regression for variable selection and prediction. The effect estimates are compared to the true β.
(vi) Prediction. Using n (independent) validation or test sets, estimates from the variable selection (step v) are plugged in the model, and the fitted outcome ŷ are compared to the true y.
(vii) To assess the performance of variable selection, the average number of the correctly and incorrectly (falsely) selected variables is computed. For prediction performance the root mean squared prediction error (eqn (3)) is computed.
Error model | 3 beta's^{a} | 6 beta's^{b} | |||||
---|---|---|---|---|---|---|---|
Nr correct^{c} | Nr false^{d} | PE^{e} | Nr correct | Nr false | PE | ||
a The glycans 1, 2, and 10 were assumed to have non-zero effects, and all other 21 glycans no effect. b The glycans 1, 2, 10, 15, 16 and 17 were assumed to have non-zero effects, and all other 18 glycans no effect. c The average number of correctly selected glycans. d The average number of falsely selected glycans, which should be close to zero. e The root mean squared error of prediction with respect to the new observations. f The rows in italics show the results of simulated data without additional error in glycans, which can be interpreted as the best achievable results under the correct models. | |||||||
Error-free | SIMdata ^{ } | 3.00 | 4.46 | 1.00 | 5.94 | 5.12 | 1.01 |
MS | 3.00 | 4.47 | 1.00 | 5.94 | 5.10 | 1.01 | |
MQN | 3.00 | 4.52 | 1.00 | 5.94 | 5.29 | 1.01 | |
TA | 2.83 | 17.96 | 1.88 | 5.18 | 15.7 | 3.75 | |
logTA | 2.97 | 19.86 | 1.83 | 5.75 | 17.2 | 3.62 | |
RP | 1.83 | 13.21 | 2.33 | 5.73 | 17.13 | 3.63 | |
MQ | 1.88 | 12.99 | 2.33 | 5.71 | 16.19 | 3.62 | |
Additive | SIMdata | 3.00 | 13.57 | 1.50 | 5.38 | 15.31 | 2.69 |
MS | 3.00 | 13.63 | 1.50 | 5.37 | 15.32 | 2.69 | |
MQN | 3.00 | 13.58 | 1.50 | 5.37 | 15.29 | 2.70 | |
TA | 2.42 | 16.46 | 2.18 | 4.91 | 14.63 | 4.57 | |
logTA | 2.61 | 17.19 | 2.16 | 5.24 | 15.47 | 4.50 | |
RP | 2.61 | 17.25 | 2.16 | 5.18 | 15.82 | 4.50 | |
MQ | 2.70 | 16.98 | 2.16 | 5.44 | 15.56 | 4.50 | |
Multiplicative | SIMdata | 2.12 | 8.90 | 2.29 | 3.03 | 8.50 | 4.90 |
MS | 2.14 | 8.89 | 2.29 | 3.02 | 8.46 | 4.90 | |
MQN | 2.12 | 8.76 | 2.29 | 3.03 | 8.46 | 4.90 | |
TA | 0.29 | 1.68 | 2.36 | 0.45 | 1.32 | 5.06 | |
logTA | 0.61 | 6.50 | 2.35 | 1.62 | 6.15 | 5.02 | |
RP | 0.64 | 4.27 | 2.36 | 1.54 | 4.36 | 5.05 | |
MQ | 0.54 | 4.14 | 2.36 | 1.46 | 4.09 | 5.05 | |
Two-component | SIMdata | 2.06 | 9.07 | 2.30 | 3.01 | 8.30 | 4.89 |
MS | 2.06 | 9.09 | 2.30 | 3.02 | 8.40 | 4.89 | |
MQN | 2.07 | 9.13 | 2.30 | 3.00 | 8.29 | 4.89 | |
TA | 0.23 | 1.66 | 2.36 | 0.49 | 1.38 | 5.05 | |
logTA | 0.63 | 6.68 | 2.35 | 1.63 | 5.93 | 5.02 | |
RP | 0.64 | 4.37 | 2.36 | 1.59 | 4.19 | 5.04 | |
MQ | 0.55 | 4.26 | 2.36 | 1.57 | 4.27 | 5.05 |
Error model | 3 beta's^{a} | 6 beta's^{b} | |||||
---|---|---|---|---|---|---|---|
Nr correct^{c} | Nr false^{d} | PE^{e} | Nr correct | Nr false | PE | ||
a The glycans 1, 2, and 10 were assumed to have non-zero effects, and all other 21 glycans no effect. b The glycans 1, 2, 10, 15, 16 and 17 were assumed to have non-zero effects, and all other 18 glycans no effect. c The average number of correctly selected glycans. d The average number of falsely selected glycans, which should be close to zero. e The root mean squared error of prediction with respect to the new observations. f The rows in italics show the results of simulated data without additional error in glycans, which can be interpreted as the best achievable results under the correct models. | |||||||
Additive | Simdata ^{ } | 2.74 | 13.91 | 1.92 | 5.26 | 15.02 | 3.87 |
TA | 1.82 | 14.00 | 2.31 | 3.82 | 11.57 | 4.93 | |
logTA | 2.26 | 15.68 | 2.29 | 4.62 | 13.86 | 4.86 | |
RP | 2.34 | 15.81 | 2.29 | 4.84 | 4.94 | 4.85 | |
MQ | 2.45 | 15.48 | 2.29 | 5.01 | 14.77 | 4.85 | |
Multiplicative | Simdata | 2.65 | 16.06 | 1.87 | 5.07 | 15.05 | 3.83 |
TA | 1.83 | 13.80 | 2.32 | 4.00 | 11.75 | 4.94 | |
logTA | 2.00 | 16.10 | 2.28 | 4.51 | 14.63 | 4.84 | |
RP | 2.22 | 15.50 | 2.29 | 4.65 | 14.01 | 4.86 | |
MQ | 2.30 | 15.28 | 2.29 | 4.96 | 13.96 | 4.86 | |
Two-component | Simdata | 2.67 | 12.25 | 2.10 | 4.32 | 10.99 | 4.37 |
TA | 1.08 | 8.77 | 2.35 | 2.49 | 7.58 | 5.01 | |
logTA | 1.66 | 13.25 | 2.33 | 3.64 | 11.64 | 4.94 | |
RP | 1.81 | 12.35 | 2.10 | 4.05 | 11.87 | 4.96 | |
MQ | 1.84 | 12.24 | 2.10 | 4.36 | 12.25 | 4.96 |
The first row of each error model in Table 3, SIMdata, serves as a reference point and can be interpreted as the best achievable results under the correct model. Even when glycans were simulated without additional measurement error, the number of the falsely selected variables was non-zero: 4.46 for 3 fixed coefficients, and 5.12 for 6 fixed coefficients in average. When additive error in the data, the number of the falsely selected (13.57 and 15.31, respectively) becomes large, indicating a poor performance of variable selection. When multiplicative error was introduced the number of both correctly and incorrectly selected variables decreased and prediction error increased. The results of two-component error were close to those of multiplicative error, which indicates that in this parameter setting multiplicative error dominates additive one. Among row-wise normalization methods, TA and logTA perform poorly for variable selection when multiplicative error was involved. For multiplicative and two-component error the performance of all normalization methods was comparable; variable selection and prediction became exceedingly difficult. Overall, the column-wise normalizations outperformed the row-wise normalizations. Moreover, the results from two column-wise normalization, MS and MQN, were similar to SIMdata under all error models of Table 2.
Based on the error model E2 (Table 4), in which multiplicative errors have very small variances compared to additive ones, the number of incorrectly selected variables for the benchmark, SIMdata, was very large across all error types, leading to a poor variable selection performance. Prediction error seemed more controllable than under E1.
Fig. 5 highlights the prediction performance. Since the results of the models including 3 and 6 β's were similar, only the performance based on the models with 6 β's under E1 and E2 is shown. In case of uncorrelated error, compared to Table 3, the effects of the error in glycans became more marked (data not shown). Comparison of the correlated and uncorrelated error based on 6 βs in the model and the two-component error model is depicted in Fig. S1 (ESI†). The correlated error caused a poorer performance in variable selection.
To summarize the simulation results, even without additional measurement error the row-wise normalization appeared to perform poorly based on simulated glycan data, in terms of false positives and accurate prediction. Adding error caused perturbation of correlation structure, and especially introducing multiplicative error resulted in smaller number of variables in the model (less correctly- and incorrectly selected variables) and larger prediction error.
Glycan traits bound to immunoglobulin G (IgG) (Fig. 1) were measured by ultra-performance liquid chromatography (UPLC) in 2035 individuals from the ORCADES study, as described in Kristic et al.^{2} As can be seen in Fig. 1, 24 different glycan peaks are quantified using UPLC, with each glycan peak containing one or more glycan structures. Abundance of individual glycan structures in every glycan peak can be found in Pucic, et al.^{25} Participants were men and women (797; 1238) aged between 17 and 100 years (median age, 54). To explore effects of different normalization on biomarker selection of age, the glycan abundance was transformed using six normalization methods shown in Table 1, and additional logRP (log-transformation of RP). To remove strong batch effects in the glycomics data, following different normalization, we performed batch correction using empirical Bayes method^{26} as implemented in the ComBat function of sva package for R.^{27} To determine the ‘best’ model, applying the lasso requires selecting a value for the tuning parameter λ in eqn (2). For this training and testing of the model, ten-fold cross-validation was applied to the lasso fits in two third of the dataset. Then, the selected model was re-fitted for predicting age to the remaining one-third of the data (validation set). For comparison of the performance of each normalization, variable selection (in terms of non-zero coefficients) and accuracy in prediction are presented in Table 5.
Glycan | Row-wise | Column-wise | |||||
---|---|---|---|---|---|---|---|
TA | logTA | RP | logRP^{a} | MQ | MS | MQN | |
Effect sizes (coefficients) | |||||||
a Log-transformation of RP is included, which is equivalent to the ALR transformation.^{12} b The number of non-zero coefficients (selected variables). c As defined in eqn (3). The smaller, the better performance. | |||||||
GP1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP2 | 0.00 | 1.36 | 32.53 | 12.27 | 2.26 | 2.28 | 2.82 |
GP3 | 0.00 | 0.00 | 0.00 | 0.00 | 0.55 | 0.00 | 0.00 |
GP4 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP5 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP6 | 2.06 | 12.05 | 123.14 | 129.56 | 9.97 | 13.33 | 14.14 |
GP7 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP8 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP9 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP10 | 0.63 | 2.55 | 90.74 | 0.00 | 8.14 | 0.00 | 1.43 |
GP11 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP12 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP13 | 0.00 | 0.00 | −58.53 | 0.00 | −7.38 | 0.00 | 0.00 |
GP14 | −1.72 | −20.65 | −259.85 | −212.61 | −14.76 | −16.56 | −20.00 |
GP15 | 0.00 | 0.00 | 0.00 | 0.00 | −2.76 | 0.00 | 0.00 |
GP16 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP17 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP18 | 0.00 | 0.00 | 0.00 | 0.00 | −0.63 | 0.00 | 0.00 |
GP19 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP20 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP21 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP22 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP23 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
GP24 | 0.00 | 0.48 | 9.39 | 0.00 | 0.00 | 1.41 | 2.14 |
Variable selection^{b} | 3 | 5 | 6 | 3 | 8 | 4 | 5 |
Prediction error^{c} | 10.52 | 10.40 | 10.45 | 10.83 | 10.46 | 10.40 | 10.36 |
Here, the results of variable selection is presented as (bold-faced) non-zero coefficients (effect size). For example, if you use TA normalization, 3 glycans (GP6, GP10, and GP14) will be selected as potential biomarkers with prediction error 10.52. In contrast, the smallest prediction error (10.36) is obtained using MQN, which selects 5 glycans (GP2, GP6, GP10, GP14, and GP24). GP6 and GP14 are the stable associated glycans, which are selected using every normalization methods. RP and log RP selected 6 and 3 glycans, respectively. However, the absolute effect size of the ‘stable’ glycans (GP6 and GP14) is ca. 10-fold greater than that of other normalizations, indicating biased results. Less robust (with smaller effect size), but evidence of association of GP2, GP10, and GP24, can be found using the majority of the normalization methods (represented by the bold-faced glycans in the first column).
To summarize, with regard to the accuracy of prediction, MQN showed the smallest prediction error, followed by logTA and MS. In terms of variable selection, MQ selected the largest number of the variables, thereby failing to provide sparse model. In particular, the strong effect size of GP14 appears to influence the selection of neighbouring GP13 and GP15. The other row-wise normalization, TA and RP (including logRP), results in seemingly ‘out-of-range’ values, which lacks model interpretability. Based on the magnitude of the effect sizes as well as variable selection, logTA (row-wise) and MS and MQN (both column-wise) seem to agree on the selected glycans.
In this work we also addressed the effect of possible measurement error present in glycans. In general, introducing error caused decrease in the strength of correlations, and in particular multiplicative error produced more perturbation in correlation structure. In case of two-component error, the tendency of decreasing correlation was more pronounced. Moreover, even with very small multiplicative error, multiplicative error dominated additive one. Our simulation study clearly indicates that understanding measurement error structure is crucial, not only to extract real signals from noise-ridden data, but also to improve accuracy and efficiency of variable selection and prediction for finding glycan biomarkers. In statistics, measurement error models or errors-in-variables models are regression models that account for measurement errors in independent variables.^{21} Here, given that the real-life data measurement error is not well studied, we did not consider how to correct for the attenuation bias in multivariate errors-in-variables regression, but applied measurement error model to generate datasets contaminated with various types of error. We also included here an arbitrary correlated error structure, assuming that the highly correlated glycan data will have correlated errors. Nevertheless, replicates are more and more used in quantification of glycans and it will soon be possible to estimate the correlated measurement error for glycan data, which will greatly improve data pre-processing of glycan measurements.
Via simulation we have shown the row-wise normalization methods can have deleterious effects on variable selection based on multiple regression with the glycans covariates. When applied to the real data, two column-wise (MS and MQN) and logTA gave similar results in terms of variable selection, prediction, and the magnitude of effect estimates. For argument's sake, assuming the glycomics data resembles the microbiome count data and is compositional, what are the possibilities to obtain valid results? Although compositional data are proven difficult to handle statistically – the covariance matrix is not positive-semi-definite (or singular) and the level of the variance depends on the mean of the distribution – statistical methods for such data have been developed. For principal component analysis, Aitchison proposed a log linear contrast form to deal with compositional data,^{29} Regarding the variable selection problem, Lin et al.^{30} addressed this in high-dimensional regression with compositional covariates, motivated by research problems arising in the analysis of gut microbiome and metagenomic data. They considered the linear log-contrast model of Aitchison and Bacon-Shone.^{31} Whether any of these approaches would be beneficial for glycomics data analysis is yet to be determined. In particular, for network analysis where the analysis is based on correlation structure, it is not at all clear how to perform analysis based on the TA normalized data.^{32}
With regard to biomarker discovery, it might be a better strategy to analyse glycans jointly. New groups of a few glycans, called derived traits, can be constructed, which represent groups of glycan structures that have similar structural and chemical properties. Up until now, these derived traits often exhibited stronger associations with studied outcome. Alternatively, we investigated here well-established variable selection and prediction method to discover multiple glycans that might be jointly responsible for association with disease trait. As shown in application to the real data (Table 5), the glycan biomarkers with large effect were selected, regardless which method was applied. To identify glycans with smaller effect and to avoid false positive and biased results, a few normalization methods (such as logTA, MS and MQN) can be employed to check the robustness of variable selection. Our simulation study clearly demonstrated that incorrect pre-processing steps might hamper discovery of reliable biomarkers. Another non-ignorable issue emerged from simulation study was the large amount of false positives (or falsely selected), due to the highly correlated nature of glycan measurements. In various scenarios many false positives were found, and therefore the task of variable selection failed. Hence, robustness of glycan biomarkers is of utmost importance for further investigation.
Regardless of the correlation structure or glycan measurement technology used, row-wise normalizations introduce spurious correlations and can therefore have an effect on downstream statistical analyses. The specific effects of other normalization methods on biomarker discovery not assessed in this paper should be studied before being implemented in measurement technology specific preprocessing procedures.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c9mo00174c |
This journal is © The Royal Society of Chemistry 2020 |