Open Access Article
Nicholas R.
Ellin
a,
Yingchan
Guo
a,
Ramón Alain
Miranda-Quintana
*ab and
Boone M.
Prentice
*a
aDepartment of Chemistry, University of Florida, Gainesville, FL 32611-7200, USA
bQuantum Theory Project, University of Florida, Gainesville, FL 32611-7200, USA. E-mail: quintana@chem.ufl.edu; booneprentice@chem.ufl.edu
First published on 27th March 2024
Imaging mass spectrometry is a label-free imaging modality that allows for the spatial mapping of many compounds directly in tissues. In an imaging mass spectrometry experiment, a raster of the tissue surface produces a mass spectrum at each sampled x, y position, resulting in thousands of individual mass spectra, each comprising a pixel in the resulting ion images. However, efficient analysis of imaging mass spectrometry datasets can be challenging due to the hyperspectral characteristics of the data. Each spectrum contains several thousand unique compounds at discrete m/z values that result in unique ion images, which demands robust and efficient algorithms for searching, statistical analysis, and visualization. Some traditional post-processing techniques are fundamentally ill-equipped to dissect these types of data. For example, while principal component analysis (PCA) has long served as a useful tool for mining imaging mass spectrometry datasets to identify correlated analytes and biological regions of interest, the interpretation of the PCA scores and loadings can be non-trivial. The loadings often contain negative peaks in the PCA-derived pseudo-spectra, which are difficult to ascribe to underlying tissue biology. Herein, we have utilized extended similarity indices to streamline the interpretation of imaging mass spectrometry data. This novel workflow uses PCA as a pixel-selection method to parse out the most and least correlated pixels, which are then compared using the extended similarity indices. The extended similarity indices complement PCA by removing all non-physical artifacts and streamlining the interpretation of large volumes of imaging mass spectrometry spectra simultaneously. The linear complexity, O(N), of these indices suggests that large imaging mass spectrometry datasets can be analyzed in a 1
:
1 scale of time and space with respect to the size of the input data. The extended similarity indices algorithmic workflow is exemplified here by identifying discrete biological regions of mouse brain tissue.
Recent advancements improving acquisition time and throughput have resulted in substantially more spectra being acquired per imaging mass spectrometry experiment.5–7 Data files can contain upwards of one million spectra per image, with thousands of individual m/z values per spectrum, which can make data processing and analysis challenging and time-consuming for these high-dimensionality datasets.8 Additionally, many studies typically consist of multiple individual datasets (i.e., biological and technical replicates of multiple samples), further complicating data processing and analysis.9,10 Post-processing techniques such as factorization, clustering, and manifold learning are useful tools that have emerged to mine these datasets to identify biological regions of interest and better understand tissue biochemistry.11–14 For example, principal component analysis (PCA) has enabled the differentiation between different stages of tumor development in cancerous tissue and aided in diagnosing disease stage.15 Although these techniques have proven useful in elucidating molecular pathology and biochemistry in tissues, each approach comes with its own challenges and limitations.16–19 For example, PCA calculates the scores and loadings through linear combinations of the mean centered data. The scores are represented as spatial-expression images and the loadings are represented as pseudo-spectra. By linearly combining the m/z bins, the scores and loadings often result in negative values or peaks. Since negative peaks in mass spectrometry have no physical basis (i.e., they would represent negative ion abundances), it can be difficult to ascribe true physical meaning to PCA results. This calls for a computational method that can examine PCA results of imaging mass spectrometry data using only physical data, such as the extended similarity indices.
Similarity measures have been applied throughout many different fields of study, to enable efficient comparisons of data.20 In chemistry, similarity measures have been used to compare molecules by representing specific features of their two-dimensional (2D) or three-dimensional (3D) structures as binary fingerprints. These comparisons are conducted to screen a large amount of structures in virtual databases to identify molecules that may have similar properties to a reference molecule.21 For example, Lavecchia et al. used similarity searching based on the Tanimoto similarity coefficient to discover six ligands, similar to 4-(2-carboxybenzoyl)phthalic acid, in the NCI database that inhibited the cell division cycle 25B (Cdc25B) protein.22 Similarity measures have also been used for compound annotation of experimental tandem mass spectrometry (MS/MS) data.23,24 MS/MS similarity calculations performing database searches by comparing experimental spectra of an unknown compound to spectra within libraries of known compounds. When analyzing complex mixtures using an untargeted technique such as liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS), identifying each of the hundreds or thousands of discrete compounds in the sample can be cumbersome. With the help of similarity matching, matching the mass spectral profiles of unknown compounds to spectral libraries to facilitate identification becomes much more efficient. However, most similarity measures only compare two objects at a time, typically a reference and a test, making these measures slow and poorly scalable. Recently, we have introduced new similarity measures, called extended similarity indices, which compare multiple objects simultaneously.25 Instead of pairwise comparisons common to traditional similarity measures, the extended similarity indices compare an arbitrary number of objects to each other simultaneously (i.e., they are n-ary functions). This has opened the door for new analysis techniques such as diversity picking, the study of large molecular libraries, chemical space visualization, clustering, and protein structure determination.26–30 The extended similarity indices provide two key advantages: they allow quantitation of the correlations between any number of objects and they can be performed with unprecedented efficiency, requiring only O(N) scaling.
Herein, we present a novel post-processing workflow that utilizes extended similarity-based algorithms to compare multiple mass spectra within an imaging mass spectrometry dataset. The utility of the extended similarity indices is demonstrated by comparing multiple PCA-correlated mass spectra from imaging datasets to distinguish morphological tissue regions. PCA correlated spectra within these morphological regions are expected to have more similar spectral content than non-correlated spectra from different regions. Using this proof-of-concept workflow, the extended similarity indices have shown that spectra with stronger PCA correlations also had greater similarity coefficients when they occupied morphological tissue regions. By applying the extended similarity indices, we can efficiently determine if the PCA correlated spectra truly represent physical regions of tissue through similar spectral content.
000 at m/z 760. 98% data reduction was performed during acquisition to reduce the overall file size. A 75 μm SmartWalk setting and 75 μm raster step size was used, resulting in 15
842 pixels (spectra) with a file size of 17.7 GB. Rat kidney images were acquired using the same instrument and method (see ESI†).
427 pixel rat brain dataset (see ESI†). This dataset had a file size of 126 GB and required a computing time of 16.03 minutes to calculate the extended similarity index of the 30 selected-pixel lists. This demonstrates a proof-of-principle of our extended similarity indices to efficiently manage the interpretation of large volumes of imaging mass spectrometry data.
A conversion to binary fingerprints is used to simplify the comparison framework for this proof-of-concept experiment to enable calculation of the similarity using the number of coinciding 1 bits. Future work will focus on the use of normalized real values to represent spectra rather than binary fingerprints. This conversion was performed by first extracting the raw intensity values of the imaging dataset using SCiLS Lab software (Bruker Daltonics, Billerica, MA) as a single 2D matrix of size m × n, where m is the number of pixels or individual spectra in the image and n is the number of m/z values (or m/z bins). Typical values of m range from 1000–10,000 and typical values of n range from 250
000–500
000. However, these values can vary depending on acquisition parameters. The raw intensities are normalized on a 0–1 scale using one of four normalization methods: local, global, localTIC, or globalTIC (Fig. 1D). Local normalization is calculated by dividing the intensity of each m/z bin within a spectrum by the maximum intensity in that spectrum. This process is repeated for all spectra in the image, with each spectrum being normalized to its own maximum intensity. Global normalization is calculated by dividing the intensity of each m/z bin within a spectrum by the maximum intensity in the entire dataset. LocalTIC normalization, or local total ion current normalization, is calculated by dividing the intensity of each m/z bin within a spectrum by the sum of all the intensities within that spectrum, and then repeats this process for each spectrum in the image. GlobalTIC normalization, or global total ion current normalization, is calculated by dividing the intensity of each m/z bin within a spectrum by the largest single total ion current pixel in the dataset. Once the spectra were normalized, an intensity threshold was defined (Fig. 1D). The intensity threshold is a user-defined value between 0 and 1 that allows conversion to a binary format. If the normalized intensity of a peak is greater than the threshold, then it is assigned as a “1.” If the normalized intensity of a peak is less than or equal to the threshold, then it is assigned as a “0”. The result is a 2-D data matrix of 0s and 1s with m rows of spectra and n columns of m/z bins or bits (Fig. 1E).
Once the spectra have been selected based on the PCA score values, normalized on a 0–1 scale, and converted to binary fingerprints, the Russell–Rao (RR) extended similarity index is calculated similar to our previous report.25 Briefly, for each group, all the selected binary fingerprints of the spectra are aligned into a 2D data matrix and summed together column wise. If the sum of the column is above the coincidence threshold, then it is counted as a similarity between the binary fingerprints (or spectra). A weight function is also applied to allow the columns with more coincident 1s to contribute more to the final similarity coefficient. Herein, similarity is calculated across the entire range of coincidence thresholds in 5% increments for each region within each PC.
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
![]() | (5) |
The weighted PC function (EwPC) will place more significance on the E-index values from the first PCs (i.e., PC 1 is weighted more than PC 2 and so forth), the squared sum function (Ewsq) will place more significance on the larger base values, and the fraction function (Ewf) evenly weighs the base values for each PC based on the total number of PCs calculated. The variable n represents the total number of base values calculated for each combination of parameters, typically the number of PCs used to calculate the PCA. For example, if five base values were calculated, n would be equal to five and p would have values from 1–5. The variable εp is the resulting value from the base function for every p principal component. To determine which combination of functions were used, some simple notation will be established. The first subscript letter after “E” will refer to which base function was used, robust or maximum, with an r or m, respectively, followed by the notations for the weight function applied, wPC, wsq, or wf. For example, if the robust base function was used with a squared sum weight function, the proper notation will be Er_wsq. After running the similarity calculation for multiple intensity thresholds with selected pixel percentage values from 0–30%, the combination of base and weight functions that results in the largest E-index value should provide the best estimate for users to find the optimal set of parameters.
log
N).
The medoid calculation uses the same extended similarity indices previously discussed but is iteratively applied to screen each score group to find the medoid spectrum or spectra. This is performed by removing a spectrum within a scores group and calculating the similarity coefficient of the remaining spectra without the removed spectrum (i.e., calculating the “complementary” similarity). The removed spectrum is then returned to the dataset, a second different spectrum is removed, and the similarity is recalculated without the other spectrum. This process is repeated for every spectrum in the dataset, with the key advantage that it can be performed in O(N). Once every spectrum has been iteratively removed and the similarity calculated, the iteration that resulted in the smallest similarity coefficient is identified as the medoid mass spectrum of the dataset.
The removed spectrum with the smallest complementary similarity is the medoid spectrum because it contributes most to the similarity of the spectra within the region. Since removing this spectrum resulted in the lowest similarity coefficient, it must contain the most amount of 1s that coincide with other spectra. It should be noted that this does not mean this spectrum has the most amount of 1s in its binary fingerprint. A spectrum could exist within the dataset that contains more 1s, but where none of these 1s are shared with bins in other spectra. The medoid spectra will contain the most amount of 1s that are shared with all the spectra in the dataset.
842 pixels, and each consists of 313
898 individual m/z bins over a m/z 401–1000 mass range. Root mean square (RMS) normalization was performed on the dataset and then 22 common lipid ions were selected for PCA using five principal components. PCA aims to reduce the dimensionality of the data by explaining as much of the variance of the dataset as possible within each PC as linear combinations of m/z bins. This results in less variance explained with each additional PC. Therefore, earlier PCs contain the bulk of the data's variance, ∼95%, while later PCs eventually will only contain the variance of the noise within the dataset (Fig. 2). For this reason, PCA was calculated using only the first five principal components. The first five principal components explain 45.8229%, 34.7466%, 8.04978%, 4.51346%, and 2.24242% of the variability of the data, respectively, with a cumulative sum of 95.38% (Fig. 2). The spatial expression images of the PCA scores successfully differentiate multiple biological regions across all principal components (Fig. 3). Specifically, PC 1 shows separation of the cerebral cortex from the white matter of the cerebellum, midbrain, and corpus callosum (Fig. 3A). Sub regions of the hippocampus have been observed in PC 1, 3, and 5, notably Ammon's horn and the dentate gyrus (Fig. 3A, C, and E). PC 2 highlights variations in lipid signal at the tissue periphery that are likely due to changes in tissue density, heterogenous matrix crystallization, and/or analyte delocalization and this PC does not contain any biological relevance (Fig. 3B). PC 3 has the cerebellum as the major region contributing to the variability (Fig. 3C). PC 4 has the granular layer of the cerebellum, a portion of the cerebral cortex, the inferior colliculus (a sub-region of the midbrain), and the choroid plexus (Fig. 3D). PC 5 shows many of the same regions as earlier PCs, including Ammon's horn, the dentate gyrus, the choroid plexus, and the granular layer (Fig. 3E). As PCA reduces dimensionality and combines significant structures, the pixels that contribute most to the same structures are also the most correlated (i.e., the low and high groups). The pixels that contribute the least to the structures present within each PC are therefore less correlated (i.e., the mid groups). The pseudo-spectra of the five principal components show the relative contribution of each ion to the variance explained by the PCs (Fig. 4). Ions with the same sign loading are positively correlated and loadings with opposite signs are negatively correlated. A larger number of ions significantly contribute to the variance in later principal components. As a result, no single ion contributes significantly more than the others after PC 2 (Fig. 4). As the spatial-expression image of PC 2 highlights both biological and non-biological regions (Fig. 3B), its pseudo-spectrum correlates all lipid ions together (i.e., these loadings are negative and correspond to negative scores in pixels, which are represented by the dark blue regions in the image). The only slightly positive loading (0.0109) in PC 2 is m/z 703.580 (Fig. 4B), meaning this ion is very weakly expressed in the spectra that make up the positive scores (i.e., the bright yellow region occupying the space just outside the tissue perimeter on the microscope slide; Fig. 3B). This small positive correlation is likely due to a small amount of biomolecule delocalization outside of the tissue that can occur during sample preparation. Additionally, this lipid ion is more lowly abundant than the other lipids identified here (e.g., 105 average arbitrary intensity compared to 106 and higher average arbitrary intensity for all other lipids analyzed).
![]() | ||
| Fig. 4 PCA pseudo-spectra of first five PCs. Pseudo-spectra are shown for (A) PC 1, (B) PC 2, (C) PC 3, (D) PC 4, and (E) PC 5. Loadings of the same sign correspond to greater positive correlation within the PC and loadings of opposite signs correspond to greater negative correlation within the PC. Twenty-one m/z values are contained in each pseudo-spectrum: 524.381, 703.580, 731.611, 732.558, 734.574, 756.557, 758.575, 760.588, 769.565, 772.529, 782.572, 786.605, 788.620, 798.543, 806.573, 810.604, 826.577, 832.586, 834.604, 844.528, 848.562, and 872.559. Since the peaks chosen correspond to biological regions of the mouse brain, (B) is nearly all negative because they are oppositely correlated to the non-biological matrix clusters (Fig. 3B). | ||
In the mouse brain lipid dataset reported here, the first 1% of selected pixels always gave the largest E-index value for a particular intensity threshold, regardless of which combination of functions were used (Fig. 5). In general, as the number of selected pixels for comparison increases, the E-index decreases, indicating that the difference in similarity between the score groups also decreases. The decrease is expected because as the number of pixels selected for comparison increases, their correlation through PCA weakens and thus so should spectral similarity. For some of the E-index functions, the values stabilize, causing somewhat of a plateau around 2–10% selected pixels, where the color in the plots are relatively consistent (Fig. 5A–C). The plateaus could indicate a point where the spectra being added to the group equally express the correlations from PCA and thus differences in similarity between the score groups, E-index, are more consistent. Some plots show an increase in E-index values as the selected pixels increase, resulting in a peak at around 9% selected pixels (Fig. 5B and E). When the E-index increases alongside the selected pixel percent, the mid scores group decreases in similarity while the low/high groups retain or even increase similarity as more pixels are added. The peaks seen in the Em_wsq and Er_wsq plots (Fig. 5B and E) along with the plateaus in the Em_wPC and Em_wf plots (Fig. 5A and C) suggest the optimal selected pixels percent is within the range of 1–15%.
![]() | ||
| Fig. 5 E-Index plots. Two base functions (eqn (1) and (2)) and three weight functions (eqn (3)–(5)) were tested for relative comparison. (A)–(C) were calculated with the maximum base function, εm and weighted functions: weighted PC, weighted square, and weighted fraction, respectively. (D)–(F) were calculated with the robust base function, εr, and weighted functions: weighted PC, weighted square, and weighted fraction, respectively. Across all the methods of calculating the E-index, 1% selected pixels was found to have the largest E-index values. The optimal parameters are estimated to be within the range of 1–10% for the selected pixels and 0.09–0.19 for the intensity threshold. PC 2 was omitted from all calculations since it is highly correlated with non-biological matrix clusters. | ||
For all pixel percentages, as the intensity threshold increases so does the E-index until about an intensity threshold of 0.09 (Fig. 5). After an intensity threshold of 0.09, the E-index values either remain relatively constant (Fig. 5A, C, 5D, and 5F) or quickly decrease and increase again creating two peaks per selected pixel percent (Fig. 5B and E). As the intensity threshold increases, less m/z bins are assigned to 1s in the binary fingerprints, and more are assigned to 0s. The more correlated spectra from PCA will retain more of the 1s as the intensity threshold increases compared to the less correlated spectra. Since the E-index evaluates the difference in similarity between the low/high groups and the mid group, if the mid group decreases in similarity more than the low/high groups the E-index will increase. The point where the E-index plateaus is the point where increasing the intensity threshold results in the same decrease in 1s for all groups. With these trends in mind, the E-index plots point to the optimal intensity threshold being within the range of 0.09–0.19 (Fig. 5).
The E-index serves to guide users toward the optimal set of parameters, so it is important to remember that the results provided here are not definitive. Each imaging mass spectrometry experiment will have its own range of optimal values and it is up to the user to determine them. Final determination of optimal parameters will be discussed in the following section.
The similarity coefficients for each score group were averaged and plotted together to help compare and interpret them in a 2-dimensional plot (Fig. 7). In this 2-dimensional plot, the differences in similarity of the three regions for each PC can be visualized. However, the similarity of the high score group of PC 2 is much greater than all the other PCs making interpretation difficult. Upon removal of PC 2, the characteristic “V” shape is observed with the remaining principal components (Fig. 7). Looking at the spatial distribution of the score groups for each PC, the low and high groups all occupy multiple biological regions of the mouse brain tissue: cerebral cortex, white matter, gray matter, corpus callosum, hippocampus, choroid plexus, and midbrain (Fig. 8). While the mid groups for every PC show no discernible structures (Fig. 8), the unique spatial distribution patterns for all the score groups confirm the extended similarity indices' ability to discern biological regions of PCA correlated spectra. The strongly correlated low and high groups occupy biological regions of tissue and have greater spectral similarity compared to their respective weakly correlated mid groups that do not occupy any biologically distinct structures (Fig. 7 and 8). By applying the extended similarity indices to the PCA of imaging mass spectrometry data, the correlated spectra can be efficiently connected to biological regions of tissue.
For intensity thresholds of 0.10, nearly all groups from every PC had more than one medoid spectrum. Within each group the binary fingerprints of the medoids were all nearly identical, with only a few lipids that vary between the spectra. The multiple medoid spectra that resulted for each group indicate that each spectrum contributes equally to the overall similarity of the region and equally represents the group. Ideally there should only be one medoid spectrum per group, but due to the low resolution of the binary fingerprint representation used here, more spectra can potentially have the 1s needed to be counted as a medoid. For example, each group had a unique set of lipids that were always present in the medoid's binary fingerprint. These consistently present lipids are the most common in the group and are thus what the medoid spectra represent. As long as a spectrum contains all of these lipids in the binary fingerprint, it can be considered a medoid. For the lipids that vary between each medoid spectrum, they are not common enough within the group to count as a similarity. If one of the variable lipids within a medoid's binary fingerprint did count as a similarity, then it would have resulted in a smaller complementary similarity and thus would need to be present in all medoid spectra.
Many of the loadings from PCA are properly correlated with their respective lipids in the binary fingerprints, such as those seen in PC 1 (Fig. 9A, C, and D). Although highly abundant lipids are easily represented in the binary fingerprints, the lipids that are of particular interest are those that are correlated with the PCA scores and loadings. The lipids at m/z 786.605 and 826.577 are correlated with the positive loadings of PC 1 and are unique to the binary fingerprint of the high group medoid (Fig. 8A, 9C, and D). Similarly, four lipids (m/z 731.611, 769.565, 772.529, and 810.604) are correlated with the negative loadings and the low group binary medoid (Fig. 8A, 9A, and D). The mid group binary medoid has a mix of lipids that are correlated to both the positive and negative loadings. However, m/z 806.573 has a negative loading but is unique to the mid region medoid binary fingerprint (Fig. 8A, 9B, and D). Since PC scores are tied to the loadings, pixels with negative scores (i.e., the low group), should be directly correlated with the lipids that have negative loadings and vice versa (e.g., m/z 810.604 and 826.577, respectively) (Fig. 9A, C, and D).
In the remaining principal components, 13 of the 22 lipids used for PCA were properly correlated to the medoids of at least one scores group and its corresponding loading when an intensity threshold of 0.10 was chosen (see ESI†). This directly ties the extended similarity indices' ability to help interpret PCA results by providing accurate real ion intensities for reference when analyzing the loadings and providing a real mass spectrum to represent biological regions highlighted with the score groups. Real ion intensities allow easier comparison of lipid abundances in different tissue regions (i.e., a lipid that is lowly abundant throughout the entire tissue, but exhibits strong correlations, or higher loading, within one specific tissue region with other more abundant lipids). It is important to note that the mid medoid may also present the same lipids that are expected to be unique to the high and low groups, such as m/z 731.611 (Fig. 9A and B). Although this situation might appear to indicate that the lipid is not unique to a specified group, the mid group is composed of pixels with both positive and negative score values closest to zero. Therefore, the mid group will exhibit the correlations from both low and high groups, but to a lesser extent since the magnitude of the scores is the strength in which a loading is expressed and the mid group is composed of the smallest score values. Unexpectedly, the mid group medoid could also express lipids that are absent in both the low and high groups. For example, m/z 806.573 is only present in the binary fingerprint of the mid group's medoid with a loading of −0.023 (Fig. 9A, B, and D). The expression of m/z 806.573 only in the mid group medoid is unexpected because scores are the strength in which a particular loading is expressed, so the low group should most strongly express all the negative loadings for that PC since it is composed of the most negative score values. Calculation of the medoid using the extended similarity indices reveals that m/z 806.573 does not follow the expected trend from the PCA results, further demonstrating the utility of this method in efficiently analyzing PCA data and the spatial distributions of analytes. Lipids that are present in all the medoids and binary fingerprints exist exclusively above the intensity threshold for nearly all pixels.
The medoid serves as an accurate representation of the loadings and the score groups, and thus the medoid calculation should be based on the lipids selected for PCA. For the extended similarity indices to be based on the lipids selected for PCA, the intensity threshold must be set so that the selected lipids exist above and below, or varies close to, this value. In the binary medoids with the intensity threshold set to 0.10, the properly represented loadings (m/z values 731.611, 769.565, 772.529, 786.605, 810.604, and 826.577) all have ion intensities that vary around the threshold value (Fig. 9A, C, and D). The medoids calculated with the intensity threshold set to 0.01 have much fewer PCA-correlated lipids that vary around this threshold value (see ESI†), meaning that the intensity threshold 0.01 is not in the range of variance for the PCA correlated lipids. When the intensity threshold is correctly set within the range of variance for the PCA correlated lipids, the extended similarity indices can accurately calculate a medoid to represent the correlated spectra. To strengthen the robustness of our results, the extended similarity indices methods were also applied on a rat kidney dataset, demonstrating consistent outcomes comparable to those observed in the mouse brain dataset (see ESI†).
It is important to note that the use of extended similarity indices is not an alternative to PCA, but a complement to it. More generally, our method can be used in conjunction with other pixel selection schemes, like clustering or other forms of matrix factorization, since it is aimed at providing a fast and robust estimate of the correlation of selected pixels, while also providing local information through the use of the complementary similarity measures (e.g., the medoid algorithm described in the main text). These are attractive characteristics when compared with other alternatives to analyse imaging mass spectrometry data. For instance, methods like t-SNE and UMAP could help identify correlated regions in the tissues, but they rely on an approximated (dimensionally-reduced) representation of the data that inevitably loses information with respect to the originally recorded pixels. Other methods, like non-negative matrix factorization could be seen as alternatives to the negative PCA loadings, but performing this factorization exactly is an NP-hard problem, and approximated algorithms have a worse computational scaling than PCA, so it is more efficient to couple the extended similarity analysis with the PCA results.
Future work will focus on representing spectral intensities with more precise resolution and moving away from PCA reliance by applying the extended similarity indices to other computational algorithms, such as k-means clustering. In order to increase the resolution of spectral intensities, the extended continuous similarity indices will be used to calculate the similarity of the spectra by representing each ion intensity with decimal values instead of binary values.33 The decimal values will offer more accurate representations of the ion intensities, while still retaining the efficiency and physical basis of the binary comparisons. Moving away from PCA reliance will enable the extended similarity indices to operate as an independent machine learning algorithm for imaging mass spectrometry data. The extended similarity indices' reliance on PCA stems from its current inability to efficiently select pixels for comparison independently. By using the extended similarity indices to develop new clustering algorithms (such as novel flavors of k-means or density clustering), pixels can be grouped together based on spectral similarity from the beginning.12 These future applications of extended similarity algorithms in the computational mass spectrometry community are promising and offer a new exploratory method of mining imaging mass spectrometry data.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00165b |
| This journal is © The Royal Society of Chemistry 2024 |