Prediction of microRNA – disease associations with a Kronecker kernel matrix dimension reduction model †

Identifying the associations between human diseases and microRNAs is key to understanding pathogenicity mechanisms and important for uncovering novel prognostic markers. To date, a series of computational approaches have been developed for the prediction of disease – microRNA associations. However, these methods remain di ﬃ cult to perform satisfactorily for diseases with a few known associated microRNAs. This study introduces a novel computational model, namely, the Kronecker kernel matrix dimension reduction (KMDR) model, for identifying potential microRNA – disease associations. This model combines microRNA space and disease space in a larger microRNA – disease space by using the Kronecker product or the Kronecker sum. The predictive performance of our proposed approach was evaluated and validated based on known association datasets. The experimental results show that KMDR achieves reliable prediction with an average AUC of 0.8320 for 22 complex diseases, which indeed outperforms other competitive methods. Moreover, case studies on kidney cancer, breast cancer, and esophageal cancer further demonstrate the applicability of our method in the identi ﬁ cation of new disease – microRNA pairs. The source code of KMDR is freely available at https://github.com/ghli16/KMDR.


Introduction
MicroRNAs (miRNAs), which are $22 nucleotides in length, are a special class of small non-coding RNAs that repress translation or cause degradation of their target mRNAs during posttranscriptional regulation. 1 According to the literature, miRNAs are involved in multiple biological or cellular processes, such as cell development, 2 differentiation, 3 metabolism, 1 and apoptosis. 4 In addition, emerging evidence has indicated that functional disruption of miRNA is associated with diverse complex human diseases, including cancer. [5][6][7][8] Therefore, predicting disease-associated miRNAs is crucial for elucidating mechanisms of pathogenicity and discovering novel drug targets. However, validating miRNA-disease association by biomedical experiments is costly and time-consuming. Given that a large number of miRNA association datasets have become available, it is necessary to design computational methods to reveal new types of disease-related miRNAs with high accuracy.
Based on the principle that functionally related miRNA molecules are likely to be regulated in phenotypically similar diseases, a number of computational tools have been put forward to uncover latent links between diseases and miR-NAs. [9][10][11][12][13] For instance, Jiang et al. 14 predicted disease-miRNA interactions using hypergeometric distribution on an integrated human phenome-microRNAome network. However, the efficacy of this method is limited in that it relies on predicted miRNAtarget interactions, which may be inaccurate and incomplete. Xuan et al. 15 established a miRNA functional similarity network derived from known disease-miRNA relationships, disease similarity, miRNA clusters and family data. Then, they predicted potential miRNAs related to a given disease based on weighted kmost similar neighbors. Considering that the aforementioned methods only utilize local network association information for ranking the potential links, Chen et al. 16 developed a global network similarity model by implementing the random walk algorithm on a constructed miRNA-miRNA functional similarity network. Shi et al. 17 also modeled the disease-miRNA relationship prediction process as a random walk on a protein-protein interaction network, which calculated functional associations between disease-related genes and miRNA-targeted genes. Similarly, MIDP 18 extrapolated new disease-miRNA interactions based on random walk on the miRNA functional similarity network. This model assigned different transition matrices to known and unknown miRNAs in order to use the prior information known about these miRNAs. To implement prediction for new diseases, random walk was applied to a disease-miRNA bilayer network, namely, MIDPE. Furthermore, researchers have recently integrated multiple similarities, including semantic similarities between diseases, functional similarities between miRNAs, and Gaussian interaction prole kernel similarities of miRNAs and diseases, to achieve better prediction performance. For example, Chen et al. 19 introduced a similarity search method named WBSMDA, based on the within-score and between-score of each candidate disease-miRNA pair, to predict novel disease-miRNA interactions. Subsequently, You et al. 20 presented the approach of path-based miRNA-disease association prediction (PBMDA) to mine latent links between disease and miRNAs on the same types of biological datasets. In addition, machine learning methods have proved efficient in this eld. Xu et al. 21 extracted four topological features from a constructed miRNA target-dysregulated network and imported these features into a support vector machine (SVM) to identify positive miRNAs associated with prostate cancer from negative ones. However, the performance of this approach is far from satisfactory because it is currently rather difficult, or even impossible, to select negative miRNA-disease association samples. To overcome this limitation, a semi-supervised model called RLSMDA, which did not need negative samples, was proposed by Chen et al. 22 This method is especially useful when applied to diseases with no known associations to any miRNA. By integrating known disease-miRNA interactions and the similarities of miRNAs and diseases, Luo et al. 23 proposed a novel computational model named KRLSM, which performed predictions on the entire disease-miRNA space by using Kronecker product algebraic properties. Recently, the method of RKNNMDA 24 used K-Nearest Neighbors algorithm to search for k-nearest-neighbors both for each miRNA and disease from the similarity scores of miRNAs and diseases, and nally obtained the candidate associations according to SVM Ranking model. However, the performance of the above models remains unsatisfactory for sparse miRNAdisease association datasets.
Considering that known miRNA-disease pairs are rare in current datasets, we address the problem of association prediction on sparse known miRNA-disease interaction networks. In this study, we propose a Kronecker kernel matrix dimension reduction model, which combines the cosine similarity matrices of miRNAs and diseases into one miRNA-disease similarity matrix by using Kronecker product or Kronecker sum to identify latent relationships between diseases and miRNAs. We tested the predictive performance of this method on HMDD datasets. The experiments show that, in terms of AUC, reliable results were achieved for 22 diseases associated with at least 60 miRNAs. Additionally, we have carried out the case studies on kidney cancer, breast cancer, and esophageal cancer to further make evaluation. Among these three important cancers, more than 90 percent of the top 50 miRNA candidates were veried by the published biological literature and by three public databases.

Data preparation
The known disease-miRNA interactions were obtained from the HMDD database (January 2014 Version). 25 Aer ltering out duplicate records, 5424 distinct, experimentally conrmed interactions were obtained, containing 378 diseases and 495 miRNAs. In addition, three other public databases (i.e., dbDEMC, 26 miRCancer, 27 and PhenomiR2.0 (ref. 28)) were used to conrm prediction results with case studies.

Problem formalization
We address the issue of identifying novel associations in a miRNA-disease bipartite network. Formally, X m ¼ {m 1 , m 2 , ., m n m } and X d ¼ {d 1 , d 2 , ., d n d } denote the sets of all miRNAs and all diseases in the network, respectively. The edge set of the network represents the known miRNA-disease pairs. We can store this network in a n d Â n m adjacency matrix A, where [A] ij is equal to 1 if disease d i interacts with miRNA m j , and is 0 otherwise. Therefore, the i-th row of A is a binary vector that represents the correlation between disease d i and each miRNA, whereas the j-th column of A stands for the association between miRNA m j and each disease. We need to calculate relevance likelihood of each non-interacting miRNAdisease pair and then infer novel associations among these pairs.

Calculation of cosine similarities for diseases and miRNAs
Cosine similarities for diseases were computed assuming that diseases showing similar patterns of interaction and noninteraction with the miRNAs of a disease-miRNA association network tend to interact in a similar way with new miR-NAs; a similar assumption was made for miRNAs. Binary vector IP(d i ) represents the interaction pattern of disease d i , which encodes the presence or absence of interaction with each miRNA (i.e., the i-th row of the adjacency matrix A). Therefore, the cosine similarity between disease d i and d j can be computed as follows: Aer calculating the cosine value for each disease-disease pair, the disease similarity matrix S d was established.
Similarly, the miRNA cosine similarity matrix S m can be calculated as follows: In this equation, IP(m i ) is the interaction pattern of miRNA m i , which encodes the presence or absence of interaction with each disease (i.e., the i-th column of the adjacency matrix A).
There are other methods to calculate a similarity matrix from interaction proles. For instance, Chen et al. 29,30 proposed using the Gaussian interaction prole (GIP) kernel. We have conducted brief experiments with GIP kernel, which indicate that cosine similarity method consistently outperform the method based on GIP kernel in terms of AUC for 22 selected diseases. The detailed results are presented in ESI Fig. S1. †

Constructing kernel matrices
We constructed kernel matrices based on cosine similarity matrices S d or S m . These similarity matrices are symmetric, but they may not always be positive semi-denite. To satisfy the positive semi-denite property, we applied a simple transformation by adding a small multiple of the identity matrix to their diagonals. We denote the resulting kernel matrices for diseases and miRNAs by K d and K m , respectively. These two base kernels, K d and K m , are independent of each other, therefore, combining the kernels into a whole kernel that directly correlates with disease-miRNA pairs is a better alternative. We can construct such whole kernels via the Kronecker product kernel or Kronecker sum kernel, namely,

Kernel matrix dimension reduction model
Based on the assumption that two similar node pairs tend to have the same connection strength, the prediction score matrix A could be written as follows: where vec($) is a vectorization function obtained by stacking the columns of a matrix into a vector. The entity [Â] ij represents a relevance score of a disease-miRNA pair (d i ,m j ). S could be considered a link similarity matrix. In this work, motivated by the report by Kuang et al., 31 we construct a link similarity matrix S based on a modied kernel matrix dimension reduction method. Dimension reduction aims at projecting our training data into a feature space with a lower dimension, which has the role of pushing similar data together and bringing dissimilar data apart. The construction of matrix S based on kernel matrix K is described below. Assume that kernel matrix K is an n Â n matrix. The eigen decomposition of K is expressed as ; v i is an eigenvector of K. L is a diagonal matrix whose elements are [L] ii ¼ l i , where l i is an associated eigenvalue of v i . Therefore, according to linear algebra theory, we can obtain the eigen decomposition of K: For further simplication, we assume that the eigenvalues of K are sorted in a non-increasing order (i.e., l 1 $ l 2 $ . $ l n ). Generally, larger eigenvalues are more important than smaller ones. Therefore, we only consider the larger eigenvalues of top p, and construct a link similarity matrix S as follows: Note that if p is not very large, l p is always greater than 0; thus, the rank of the link similarity matrix S is p, and the rank of the kernel matrix K is always not less than p. Hence, we call this method the kernel matrix dimension reduction method (KMDR). Finally, substituting eqn (5) into eqn (3), we obtained the general formula of KMDR as follows: where L* is a diagonal matrix whose elements are [L*] ii ¼ l(i{ 1, 2, ., n}), where l is equal to l i if i˛{1, 2, ., p}, and is 0 otherwise. Obviously, if we use a different kernel matrix, the nal prediction score matrix by KMDR will also be different. Hence, based on the Kronecker product kernel and Kronecker sum kernel, KMDR could result in two independent sub-algorithms: KMDR-KP and KMDR-KS; KP and KS are short for Kronecker product and Kronecker sum, respectively. Fig. 1 illustrates the overall owchart of the KMDR method.
Note that there is a slight difference between this model and the method described by Kuang et al. 31 We use the larger eigenvalues of top p to combine the symmetric matrix v i v i T (i{ 1, 2, ., p}), while the method described by Kuang et al.
uniformly uses a single constant, and therefore, may not be able to distinguish between the importance of different eigenvalues.

KMDR-KP
In KMDR-KP, the Kronecker product K d 5 K m of the disease and miRNA kernels is Hence, the size of the kernel matrix K is n d n m Â n d n m , which would require a large memory overhead even for a moderate number of diseases and miRNAs. To reduce computational cost, a more efficient improvement has been made on the basis of eigen decompositions, as performed in. 32 Let T be the eigen decompositions of the kernel matrices K d and K m . As the vectors (eigenvalues) of a Kronecker product are the Kronecker product of vectors (eigenvalues), we can rewrite the Kronecker product kernel as To efficiently multiply this kernel matrix with vec(A T ), we make good use of an algebraic property of the Kronecker product, that is, (B 5 C) vec(X) ¼ vec(CXB T ). Aer the conversion, the nal prediction score matrix can be written as follows: here the denition of L* is similar to that in eqn (6).

KMDR-KS
In KMDR-KS, the Kronecker sum kernel is dened as K ¼ K d 4 K m . Similar to KMDR-KP, the nal prediction score matrix of KMDR-KS is the same as eqn (8). However, for the Kronecker sum kernel, L ¼ L d 4 L m . Therefore, the main difference between the two sub-algorithms is that they have different eigenvalue sets {l 1 , l 2 , ., l p }, that is, There is a parameter p in the construction of the link similarity matrix S. Here, we choose p ¼ [n Â q], where n is the size of kernel matrix K, and q˛[0, 1] is a proportion coefficient. The symbol [$] represents the Gauss rounding function. Notably, q was set as 0.25 in all experiments, and 0.25 was also chosen as the optimal parameter q in the method described by Kuang et al. 31 This is equivalent to projecting the data onto the subspace spanned by the top 25% principal components.

Performance evaluation
To evaluate the predictive capability of a method on a sparse set of known associations, we randomly divide all known associations of each disease into ten disjointed subsections, nine of which are used as testing samples and the remaining one is used as a training sample through multiple iterations. As diseases associated with only a few miRNAs may be insufficient to assess the capacity of the prediction method, we selected 22 human diseases, which are associated with at least 60 miRNAs, as test cases. Since the cosine similarities for diseases and miRNAs are constructed on the basis of known disease-miRNA associations, we need to recalculate the cosine value for each run when the known associations change. The area under the ROC curve (AUC) was computed to assess the quality of the predicted associations. AUC ¼ 1 indicates perfect classication, whereas AUC ¼ 0.5 reects random classication. Additionally, considering that there are few known disease-miRNA associations, we also adopted a precision-recall (PR) curve, and the area under the PR curve (AUPR) served as a complementary quality measure.
To demonstrate the effectiveness of the KMDR model, we compared its two sub-algorithms with six state-of-the-art models, namely, MIDP, 18 MIDPE, 18 RLSMDA, 22 WBSMDA, 19 KRLSM, 23 and RKNNMDA. 24 The parameters in MIDP, MIDPE,  Fig. 2 shows the comparison of the ROC curves from each method. Fig. 3 displays the PR curves and the average AUPR scores of the above eight methods. It is obvious that the PR curves of KMDR-KP and KMDR-KS lie above those of MIDP, MIDPE, RLSMDA, WBSMDA, KRLSM, and RKNNMDA. The average AUPR values achieved by KMDR-KS were 6.32%, 11.53%, 11.73%, 19.37%, 10.07%, 11.46%, and 12.65% higher than those of the other seven methods. These prediction results suggest that the KMDR model performs well with diseases that are associated with only a few known miRNAs. This might be attributed to the fact that KMDR successfully combines the spaces of diseases and miRNAs into a single disease-miRNA space by using Kronecker sum. However, for two diseases, namely, "Heart Failure" and "Leukemia, Myeloid, Acute", MIDPE and WBSMDA achieve higher AUCs than KMDR-KS; this could be because our method only adopts the topological structure of the disease-miRNA bipartite network.

Case studies
Usually, the top-ranked associations are more important for each disease. The number of correctly identied known disease-miRNA interactions under different top selections is shown in Fig. 4. For example, among the 5424 known disease-miRNA interactions, KMDR correctly detected 3258 (or 60.07%) known associations in the top 50 predictions. The result shows the effectiveness of KMDR in identifying conrmed disease-miRNA interactions.
To further conrm the ability of KMDR to discover new miRNA-disease interactions, we present case studies of several important diseases (kidney neoplasms, breast neoplasms, and esophageal neoplasms). All known interactions included in the HMDD database are taken as the training set, and the noninteracting pairs of each disease are ranked according to the prediction scores. Predictive results were validated based on experimental literature and three recently updated disease-miRNA databases, namely, dbDEMC, 26 miRCancer, 27 and PhenomiR2.0. 28 As a common urologic malignancy, the incidence and death rates of kidney cancer have been rising gradually. According to the report of the American Cancer Society in 2016, there would be approximately 62 700 new cases of kidney cancer, and 14 240 deaths, in America. 33 Recent biological experiments have shown   This journal is © The Royal Society of Chemistry 2018 that many miRNAs are related to kidney cancer. Here, we implemented KMDR-KS to identify candidate kidney neoplasmassociated miRNAs. As a result, using the dbDEMC and miR-Cancer databases, all of the top 50 miRNA candidates were identied as being associated with kidney cancer (see Table 2).
For the top 5 predicted candidates, hsa-mir-155 and hsa-mir-126 were found to be up-regulated in renal cell carcinoma, 34,35 while hsa-mir-145, hsa-mir-200b, and hsa-mir-146a were iden-tied as being down-regulated. 36,37 Notably, only 7 known miRNAs were associated with kidney neoplasms in our gold   33 Previous studies have shown that multiple miRNAs have links with the progression of breast neoplasms. By implementing KMDR-KS to predict novel miRNA candidates associated with breast neoplasms, we conrmed that 45 out of the top 50 predicted miRNAs are present in dbDEMC, miRCancer, and PhenomiR2.0 (see Table 3). Furthermore, some potential candidates were validated by searching the literature on the PubMed website. Specically, the expression of hsa-mir-378a (ranked 8th) increases during breast cancer formation. 38 Hsa-mir-542 (ranked 13th) has been iden-tied as being signicantly down-regulated in breast cancer cells. 39 In addition, hsa-mir-532 (ranked 26th) is markedly upregulated in breast cancer tissues relative to normal tissues. 40 Esophageal cancer is the eighth most frequently diagnosed cancer worldwide, and it is considered the sixth leading cause of cancer-related death on account of its poor prognosis. Early detection and timely treatment of esophageal cancer is very helpful in improving the chance of a patient's survival. In our standard association dataset, 74 known miRNAs are related to esophageal cancer. Among the top 50 predicted candidates ranked by KMDR-KS, 47 miRNAs are corroborated by the three aforementioned databases (see Table 4). Additionally, hsa-mir-200b (ranked 4th) was supported by experimental literature as being correlated with esophageal neoplasms. 41 The results of the case studies fully illustrate that KMDR-KS performs well in predicting potential disease-associated miR-NAs. Therefore, we further used KMDR-KS and KMDR-KP to rank potential candidates associated with each disease contained in HMDD (shown in ESI Tables S1 and S2 †), in the hope that these prediction results will be validated by future biological experiments.

Discussion and conclusions
Identifying potential miRNA-disease associations could help discover novel biomarkers for clinical diagnosis, treatment, and prevention. Previous computational models remain difficult to use efficiently for diseases with a few known associated miR-NAs. Therefore, a Kronecker kernel matrix dimension reduction model (KMDR) was implemented to identify hidden miRNAdisease associations. KMDR combined the spaces of miRNAs and diseases into a whole miRNA-disease space by using Kronecker product or Kronecker sum. Compared with six existing computational methods, KMDR achieved higher AUC values in most selected diseases. Moreover, case studies on kidney cancer, breast cancer, and esophageal cancer were done, and 100%, 96% and 96% of the top 50 miRNA candidates for each of these three important diseases were veried by the literature and by databases. These results have shown that KMDR can reliably identify disease-miRNA associations for clinical and experimental validation.
The reliable performance of KMDR can be contributed to several factors. To begin with, our method combines the cosine similarity matrices of miRNAs and diseases into a larger miRNA-disease similarity matrix, which directly relates disease-miRNA pairs and could effectively improve the prediction performance. Second, negative miRNA-disease association samples are not needed in KMDR. Finally, KMDR is a global prediction model, which could be used to infer hidden miRNAs for all the diseases simultaneously. Despite the efficiency and practicability of KMDR, there still exist some inevitable limitations that need further research. To begin with, like some other models, [42][43][44] KMDR only depends on the topological structure of the miRNA-disease network, which means it cannot predict associations for a disease that does not exist within the network. To solve this problem, extensional biological information, like miRNA functional similarity data and disease semantic similarity data, can be integrated to expand the application range of KMDR. Second, our similarity matrices for KMDR might not be optimal in some scenarios. Finally, as the currently known miRNA-disease associations are insufficient, more information about diseases and miRNAs can be used for constructing more reliable disease-similarity and miRNA-similarity matrices, which may potentially improve prediction results. For example, we will integrate disease-gene interactions and miRNA-gene interactions in our future work.

Conflicts of interest
There are no conicts to declare.