Self-supervised clustering of mass spectrometry imaging data using contrastive learning

Mass spectrometry imaging (MSI) is widely used for the label-free molecular mapping of biological samples. The identification of co-localized molecules in MSI data is crucial to the understanding of biochemical pathways. One of key challenges in molecular colocalization is that complex MSI data are too large for manual annotation but too small for training deep neural networks. Herein, we introduce a self-supervised clustering approach based on contrastive learning, which shows an excellent performance in clustering of MSI data. We train a deep convolutional neural network (CNN) using MSI data from a single experiment without manual annotations to effectively learn high-level spatial features from ion images and classify them based on molecular colocalizations. We demonstrate that contrastive learning generates ion image representations that form well-resolved clusters. Subsequent self-labeling is used to fine-tune both the CNN encoder and linear classifier based on confidently classified ion images. This new approach enables autonomous and high-throughput identification of co-localized species in MSI data, which will dramatically expand the application of spatial lipidomics, metabolomics, and proteomics in biological research.


Methods and data analysis Mass spectrometry imaging data
Mouse uterine and brain tissue MSI datasets used as examples in this study have been previously reported. 1,2 Uterine tissues from 2-month-old female mice were snap-frozen, and frozen sections were made with a cryostat. The mice were maintained on a C57BL6 mixed background and housed in the vivarium at the Cincinnati Children's Hospital Medical Center according to NIH and institutional guidelines for laboratory animals. This protocol was approved by the Cincinnati Children's Hospital Research Foundation Institutional Animal Care and Use Committee. Mouse uterine tissue was analyzed using nano-DESI MSI on a Q-Exactive HF-X Orbitrap mass spectrometer (Thermo Fisher Scientific, Waltham, MA) equipped with a custom-designed nano-DESI source. 3 Mass spectra were acquired in the m/z range of 133-2000 in both positive and negative ion modes with a spatial resolution of 10 μm. Positive mode MSI data for mouse brain tissue was obtained from METASPACE, 4 a community resource that provides open access to MSI data. The specific data were acquired using MALDI MSI in the m/z range of 600-1000 with a spatial resolution of 20 μm. The dimensions of the two MSI datasets are listed in Table S1.

Data pre-processing
To generate ion images from MSI data, we used peak detection and m/z binning as described in our previous study. 5 The signal intensity for each m/z in each pixel was extracted from the corresponding mass spectrum with a bin width of ± 10 ppm and normalized to the total ion current. To remove visual spikes, pixels with intensities > 0.999 quantile were reassigned with the 0.999 intensity quantile value. Ion images of mouse uterine tissue were resized to 96 x 96 pixels. Meanwhile, larger-size ion images of mouse brain tissue were resized to 224 x 224 pixels. Most pre-trained CNN models and Pytorch transform functions accept RGB images. Thus, raw pixel intensities of ion images were normalized between 0 and 255 and copied to 3 channels. More specifically, we converted ion images into the PIL format before the data augmentation step. To benchmark our approach, we manually selected 367 mouse uterine ion images with distinct ion distributions and clustered them into 13 groups according to molecular colocalizations as shown in Fig. S6. For the unannotated mouse brain dataset, we detected 1101 peaks from the average spectrum and generated the corresponding ion images as shown in Fig. S8.

Architecture of the self-supervised clustering
We approached the challenge of molecular localization clustering as an image classification task. We aimed to re-train a CNN model for an individual MSI dataset to classify ion images based on the high-level spatial features without manual annotations. The model architecture is shown in Fig. 1. The pre-trained CNN (EfficientNet-B0) is re-trained by contrastive learning and self-labeling sequentially in a self-supervised manner. We use EfficientNet-B0, which has been trained on the ImageNet database. We used the EfficientNet-B0 model before the classification layer as an encoder. In our architecture, we firstly learned ion image representations through the contrastive learning. More specifically, SimCLR 6 approach is adopted in this study. After this first phase of training, we fed ion images through the re-trained encoder to produce a set of feature vectors, which were then passed to a spectral clustering (SC) classifier to generate the initial labels for the classification task. To complete a learnable classification CNN, a linear classifier (a linear layer followed by a softmax function) was attached to the encoder and trained with the original ion images and initial labels as inputs. Finally, we utilized a self-labeling 7 approach to fine-tune both the encoder and classifier, which allows the network to correct itself. Fig. S1. The representation of an ion image is the output before the classification layer of EfficientNet-B0. A small multilayer perceptron with one hidden layer maps the representations to the projection space ( Z ) where the contrastive loss is applied. In the training step each ion image is used to generate a pair of augmentations. We treat two augmentations of one ion image as a positive pair ( , ) ij. Then the loss function is defined as where 2N is the number of augmented images, , To evaluate the quality of learned representations for mouse uterine benchmark, we used a linear evaluation and the nearest neighbor mining. In the linear evaluation, one average ion image was generated from each manually classified group of images as the centroid of the cluster. A linear classifier was subsequently trained on top of the frozen retrained encoder, with 13 average ion images and their corresponding annotated labels. Next, all original ion images were classified by the encoder and the updated linear classifier. The resulting classification accuracy is used as a proxy for the quality of image representation. 6 In the nearest neighbor mining protocol, for each ion image, we searched its K nearest neighbors (K  [1, 30]) based on cosine distance in the representation space. We quantified the purity of the neighborhood by counting the annotation-matching pairs for each image and their nearest neighbors. We also visualized the learned representations using t-SNE with scikitlearn default settings.

Image clustering
It has been demonstrated in a previous report 7 and this study (Fig. 2e) that images with similar high-level spatial features are mapped together in the representation space using contrastive learning. To leverage the meaningful local neighborhoods, SC was adopted to cluster ion images as the classification pretext task. Based on the cosine distance, 10 nearest neighbors of each ion image were identified to construct a graph. Next, we used the discretization approach to cluster nodes in the graph after the Laplacian embeddings. 8 After the SC, we used the resulting classification labels to initialize a learnable linear classifier on top of the CNN encoder, which enables the following self-labeling process to fine-tune the model (step 2 in Fig.  1). More specifically, the linear classifier is composed of a linear layer and a softmax function. We set the encoder in frozen mode and trained the classifier with original ion images and initial labels obtained from SC.

Self-labeling
As reported in a previous study, self-labeling improves the CNN model using a plain criterion: two independently augmented images from one ion should be classified into the same cluster. With this principle, the self-supervised training is able to enhance the generalization power of the model. In addition, only the confidently-classified ion images are included in this training. 7 As shown in Fig. 2f and Fig. S5, we observed that the classification accuracy obtained for selected mouse uterine ion images increases with an increase in the softmax probability threshold. This indicates that the softmax probability threshold may be used to exclude falsely classified ion images from the training, which enhances the accuracy of the CNN model in the self-labeling step. In the implementation, we empirically selected the probability threshold with respect to the sample population. More specifically, after the initialization of the linear classifier, we selected the 40% quantile of softmax probabilities as the threshold. The selected training data containing 60% of the original ion images has a larger fraction of correctly classified images than the original data. Training samples were updated with the same probability threshold at every epoch, thus we gradually included more samples into the training (Fig. 2g). For each selected ion image, we applied a weak and a strong augmentations, respectively. Two pseudo labels were obtained after the encoder and classifier. A weighted cross-entropy loss was then applied to the minibatch of weakly augmented ion images to update the parameters of the CNN model.

Clustering accuracy evaluation
In this study, we used over-clustering to effectively capture the intra-class variance. The clustering accuracy was calculated using the following equation: Where 12 , , , I c c c is the set of predicted clusters, 12 , , , J t t t is the set of ground truth classes, N is the total number of ion images. In each cluster, the most frequent ground truth class was identified and assigned as a predicted class for the whole cluster. The accuracy was calculated by counting the correctly predicted ion images and dividing by N .

Comparison of the self-supervised clustering with vector-based methods
First, we evaluated the pairwise similarity measurements with different input data. In the current method, we used CNN feature vectors for the downstream classification task. For a comparative study, image vectors were generated by flattening the original images after 0.999 quantile hot spot removal as a pre-processing step. Three metrics, including Euclidean distance, cosine similarity, and Pearson correlation were calculated for comparison.
Next, we evaluated the clustering accuracy of two established machine learning methods that rely on image vectors: a density based clustering analysis with nonlinear dimensionality reduction 9 and Ward hierarchical clustering 5 .The Uniform Manifold Approximation and Projection (UMAP) was performed using the umaplearn package with the following parameters: n_components = 3, n_neighbors=5, min_dist=0.5, metric = 'cosine'. We note that these parameters were optimal for visualizing and grouping of image vectors, but were not best suited for the CNN feature vectors (Fig. 2d as reference). Nevertheless, we used the same UMAP parameters in Fig. 4c and 4d for the visualization of both results for an objective comparison between them. Hierarchical Density-based Spatial Clustering for Applications with Noise (HDBSCAN) was performed to cluster UMAP representations using default parameters except for "min_cluster_size". We tuned "min_cluster_size" in order to obtain either 13 or 20 clusters. In all the HDBSCAN clustering results, the number of noise data points was less than ten. In the Ward hierarchical clustering, each vector was normalized to the range of [0, 1]. The method was implemented using scipy package (scipy.cluster.hierarchy.linkage) with the following parameters: method='ward', metric='euclidean'."

Isotopic recall evaluation
Isotopic recall is a metric reported in a previous study 10 to evaluate the performance of ion image clustering. Isotopic ion images were identified based on the m/z shift and Pearson correlation of the ion image with that of the candidate monoisotopic peak. Specifically, mass spectral features separated from each other by S5 m/z 1.003 with a tolerance of m/z 0.01 were assigned as candidate isotopic peaks. Second, ion images of isotopic peaks should have a Pearson correlation greater than 0.5. For the self-supervised clustering results, we counted the number of isotopic images that were correctly clustered together and divided it by the total number of identified isotopes. This fraction gives the isotopic recall, which is in the range of 0 to 1.

Data augmentation
Data augmentation is crucial for contrastive learning and self-labeling. We systematically studied the impact of data augmentation operators as listed in Table S3. For each single augmentation operation, we randomly sampled a parameter interval for an image transformation function. Moreover, we classified the compositions of data augmentation operators into three classes. Medium augmentation was used in SimCLR, while weak and strong augmentations were used in self-labeling.

Model training protocal
For SimCLR training step, we used Adam optimizer with the initial learning rate of 0.001. A cosine annealing with a period of training epochs was used to decay the learning rate. According to the previous report, SimCLR benefits from larger batch sizes. 6 In our implementation, we used the largest batch size allowed by the GPU memory and trained for 100 iterations. For the self-labeling step, an Adam optimizer with the initial learning rate of 0.0001 and the same cosine annealing schedular were used. We trained the CNN for 300 iterations. We implemented the model training on Google Colaboratory, a cloud computing platform. Using the NVIDIA Tesla P100-16GB GPU, the total training time for mouse uterine MSI benchmark and mouse brain MSI dataset was about 15 and 45 minutes, respectively. Table S1. Information of benchmark mouse uterine dataset and unannotated mouse brain dataset. Table S2. Accuracy of spectral clustering results with 20 clusters at conditions of varying neighborhood sizes. Ion image representations were generated by encoder shown in Fig. 2d. Table S3. A summary of image augmentation operator parameters. The kernel size in Gaussian blur is set to be 10% of the image height/width. Off-tissue mask was not applied to unannotated dataset.  Figure S1. The framework of SimCLR.        A zoom-in image shows that three ion images (shown in Fig. 4a) are clustered together. Although Ward hierarchical clustering is able to capture major patterns in input data, it cannot differentiate between the ion image of m/z 739.4681 and two other ion images used in this example.