Rupali
Mankar
a,
Michael J.
Walsh
b,
Rohit
Bhargava
c,
Saurabh
Prasad
a and
David
Mayerich
*a
aDepartment of Electrical and Computer Engineering, University of Houston, Houston, TX, USA. E-mail: mayerich@uh.edu
bDepartment of Pathology, University of Illinois – Chicago, Chicago, IL, USA
cBeckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, Champaign, IL, USA
First published on 16th January 2018
Tissue histology utilizing chemical and immunohistochemical labels plays an important role in biomedicine and disease diagnosis. Recent research suggests that mid-infrared (IR) spectroscopic imaging may augment histology by providing quantitative molecular information. One of the major barriers to this approach is long acquisition time using Fourier-transform infrared (FTIR) spectroscopy. Recent advances in discrete frequency sources, particularly quantum cascade lasers (QCLs), may mitigate this problem by allowing selective sampling of the absorption spectrum. However, DFIR imaging only provides a significant advantage when the number of spectral samples is minimized, requiring a priori knowledge of important spectral features. In this paper, we demonstrate the use of a GPU-based genetic algorithm (GA) using linear discriminant analysis (LDA) for DFIR feature selection. Our proposed method relies on pre-acquired broadband FTIR images for feature selection. Based on user-selected criteria for classification accuracy, our algorithm provides a minimal set of features that can be used with DFIR in a time-frame more practical for clinical diagnosis.
Recent attempts to address these issues include the use of discrete frequency infrared (DFIR) imaging. DFIR allows the collection of individual wavelengths using either a tunable quantum cascade laser (QCL) source16,17 or filter bank.18 However, sparse collection of spectral features is only possible if the required bands are known a priori. In this study, we address the problem of developing an optimized search for finding bands that are important for multi-class histological labeling of biological tissue. TMA cores are imaged using FTIR to generate data for our search. Due to the nature of DFIR images, feature selection methods, such as principal component analysis (PCA), and independent component analysis (ICA),19 and vertex component analysis (VCA)20 are impractical, since they result in features that have broad spectral support. Our focus is on feature selection, since the resulting features can be translated to DFIR image acquisition, significantly reducing the time required for image acquisition and eliminating the need for later dimension reduction. We compare the results of the proposed GPU-based genetic algorithm using linear discriminant analysis (GA-LDA) with other prominent feature selection methods, demonstrating that this technique provides comparable results to feature extraction methods. Finally, we show that these features provide excellent accuracy when applied to DFIR images acquired using a commercial QCL-based imaging system.
FTIR spectroscopy usually relies on illuminating a sample using a broad-band (globar) source that is passed through an interferometer to collect an interferogram. Individual wavelengths are separated using a Fourier transform. Data rate is usually limited by the low intensity of benchtop broadband IR sources, combined with the limited size of compatible mercury cadmium telluride (MCT) focal plane array (FPA) detectors. Data collection is often followed up with noise and dimension reduction algorithms to further improve SNR and mitigate the problems associated with a large number of features. This suggests that current histological classification relies on a sparse spectrum, containing a limited number of important components necessary for classification.25,26 Consequently, DFIR imaging may provide an alternative to FTIR imaging by allowing direct imaging of the sparse features necessary for classification and thereby mitigating problems with both acquisition and data size.27 DFIR instruments coupled with narrow bandwidth quantum cascade lasers can image a tissue sample at spectral bands optimal for classification.28 QCL-based instruments provide a coherent source, allowing the use of high-resolution bolometer detectors and better optics.17 With a priori knowledge of important features, it is possible to limit collection to the most informative bands for classification.
Raw hyperspectral imagery has thousands of spectral features, making them difficult to classify due to memory constraints and prone to over-fitting. Dimension reduction is generally used to provide a more concise spectral representation. The most common approaches project the spectra into a new space and sort them based on some standardized score. The highest scored features are used in classification. Two broad types of dimension reduction are used: projection-based feature extraction and feature selection.29 In projection-based feature extraction, basis functions are computed from high dimensional data using either an unsupervised or supervised approach. Principal component analysis (PCA) and Linear discriminant analysis (LDA), with some variations, are widely used feature extraction methods.30,31 In the case of PCA, features are sorted based on the amount of variance accounted for in each projection,32,33 whereas LDA relies on supervised data to optimize linear separability.
Feature selection methods rely on known sparsity within the spectrum, and common methods include the least absolute shrinkage and selection operator (LASSO), sequential feature selection method minimum Redundancy Maximum Relevance (mRMR)34 and evolutionary learning genetic algorithms. As discussed earlier, our goal is to select features so that we can take advantage of DFIR imaging for fast tissue characterization.
Hyperspectral images collected using FTIR also undergo preprocessing, such as baseline correction and normalization, in order to mitigate spectral features correlated with sample structure and scattering.23 The proposed feature selection is performed on preprocessed FTIR images of tissue microarrays from multiple patient biopsies. A GA is designed to perform supervised feature selection, relying on mRMR as an initialization step and LDA as an optimization metric. This method is ideal for optimizing features across multiple classes and classifiers. We then validate the selected features, demonstrating that the GA-LDA approach provides results are of comparable quality to feature extraction while also being compatible with DFIR imaging. Finally, we validate this approach for QCL-based images of tissue biopsies collected using a commercial DFIR imaging system (Fig. 1).
The FTIR images were annotated by an experienced pathologist based on adjacent histological staining. Histological stains included hematoxylin & eosin (H&E) and Masson's trichrome for all samples. This allowed us to examine general tissue structure and annotate collagen and smooth muscle. Since the breast images were the most histologically complex, additional immunohistochemical labels were used to confirm tissue phenotype. These include cytokeratin 19 for differentiating between epithelium and necrosis, as well as alpha smooth muscle actin (αSMA) and vimentin for identifying myofibroblasts.
Labels focused on cancer-relevant phenotypes ranging from 4–7 classes types, depending on tissue type. Feature selection was performed using the proposed GPU-based GA-LDA optimization method (section 1). Our goal was to select features from FTIR data that provide high-quality classification with finite support optimal for DFIR imaging. Since current commercial QCLs are limited to the 920–1800 cm−1 range, analysis was limited to this window for feature selection. After the desired number of features were selected, the annotated data was then used to train a Random Forest36 classifier. For random forest classifier 100 number of trees are selected and at each decision split of a tree default number of features are selected which is square root of total number of features used for classification. Ensemble treebagger is executed with parallel mode option using 6 threads on CPU. Classifier performance was validated on independent TMA images. When training and validating classifiers, it is crucial to ensure that individual cores are not shared by the training and validation sets. In such a case, any deviation in focus will cause chemical information to smear across pixels resulting in inflated performance measurements.
Finally, we validated the performance of our features in practice using DFIR images. TMA's were re-imaged using a QCL-based DFIR imaging system (SPERO, Daylight Solutions) with a 12.5× 0.7NA refractive objective. Data was collected at discrete frequencies specified by the proposed optimization algorithm. Note that all features require corresponding baseline points and a normalization band. Since the FTIR data were normalized to Amide I (1650 cm−1), each spectral feature would require at most two adjacent baseline bands. However, note that optimized features often share similar peaks and can share baseline points (Fig. 2). Rubber band baseline correction was used,23 combined with normalization to Amide I peak. No other noise reduction or pre-processing was applied. No other noise reduction or pre-processing was applied.
The mRMR algorithm extracts these features iteratively by optimizing the following condition for each j ∈ [1,…, N]:
(1) |
(2) |
The mRMR algorithm performs an iterative search for optimal features, which are added to the final set n until the desired N features are identified (Algorithm 1). A feature index j corresponding to feature xj in F is added to the final feature set if it exhibits: (i) maximum mutual information with the distribution of class indices in the training set, and (ii) minimum mutual information with previously selected features in n.
Output: n ∈ N
1: n = ∅
2: while |n| ≤ Ndo //while features <N
3: j = 1
4: whilej ≤ Bdo
5: find j such that eqn (1) is maximized
6: append j to n
7: end while
8: end while
While mRMR algorithm runs in O(n2) (quadratic) time and is deterministic, this efficiency comes in the form of a greedy algorithm that limits sampling to an extremely small subspace. Optimization is constrained by previously selected features, which are likely sub-optimal. Adding additional features increases redundancy rather than improving classification accuracy – particularly for complex data.
Given a number of desired features N ∈ , a single GA attempts to select an optimal set of wavelengths that maximize classification accuracy (Fig. 1). Our GA is first initialized with a set of probable solutions, referred to as the initial population, one of the probable solution is initialized using mRMR feature selection algorithm (section 1).
Input to the genetic algorithm is a two dimension feature matrix loaded from the original three dimension hyperspectral image. In this feature matrix, rows are pixels (samples) and columns are spectral features (bands). Over the course of G generations, our GA attempts to select an optimal feature set given a specified number of target features N. Each genome is a vector of size N, and each element of a genome is a spectral feature index from the input feature matrix F. The population matrix P is a matrix P ∈ P×N, where N number of features to be selected using GA-LDA algorithm and P is the population size, which determines how densely the feature subspace is sampled. Increasing P and G provide additional optimization at the cost of processing time, where P is user-specified and G is determined by a stopping condition described later.
At each generation g ∈ [0, G), all genomes from the current population are evaluated using an optimization function and ranked into R according to their fitness values Vg. Based on these rankings, three population evolution operations are performed on current population reproduction, crossover, and mutation to generate next generation population. This evolutionary search for optimal spectral features continues until a stopping criteria is met. In our algorithm stopping criteria is until execution it reaches to maximum number of generations Gmax or improvement in fitness value of best highest ranked genome is stalled over Gs number of generations, here Gmax and Gs are user defined numbers and Gs ≪ Gmax.
Output: n ∈ N
1: G = 0
2: initialize population matrix P0 ∈ P×N
3: whileG < Gmaxdo
4: for each genome p ∈ [0, P) do
5: evaluate genome fitness [p]
6: end for
7: generate a sorted list of genomes α ∈ [0, P) based on
8: ex. α[0] = argmaxj[j] and α[P − 1] = argminj[j]
9: if genome α[0] meets stopping criteria then
10: return genome α[0]
11: end if
12: Generate new population PG+1:
13: reproduce genomes α[0] to α[nr − 1]
14: generate new genomes for α[nr] to α[N − 1] by crossover
15: randomly mutate nm genomes in α[nr] to α[N − 1]
16: G = G + 1
17: end while
18: return genome α[0]
Evolutionary learning is based on natural selection, such that the highest ranked nr < P genomes are preserved (reproduction) in next generation population G + 1. The remaining nc = P − nr genomes are updated by performing the crossover operation on current generation genomes. Crossover is performed on parent genomes using a tournament selection method. A subset of ts parent genomes are selected from the current generation population PG from the subset of nc genomes that are not reproduced. Each pair of selected parents go through crossover to generate a new genome. Once nc new genomes are generated in Pnext then with random sampling nm genomes from nc crossovered genomes go through mutation.
For example, assume a population of P = 20 genomes with nr = 2 and nm = 10. The next generation will consist of the nr = 2 optimal genomes from the previous generation, while the remaining nc = 18 genomes will consist of permutations of the previous sub-optimal genomes. This crossover is performed by selecting parents from the sub-optimal nc = 18 genomes from the previous generation. Out of these nc = 18 crossover genomes, nm = 10 undergo random mutations. This random selection is done by using mutation probability, genomes for which randomly generated probability is greater than user defined mutation probability will go through mutation. In mutation operation, one or more genome entries are altered based on randomly generated probability to any feature index from the entire spectral range of input data. Mutation is important for GA as it avoids getting trapped in a local minimum.
Once a new population is obtained, the next generation is evaluated (Algorithm 2) until one of two conditions are met: (i) a maximum number of generations Gmax is exceeded or (ii) there is less than ε improvement in the fitness score of the best genome.
The fitness function is a key component of the GA. Since statistical classifiers are frequently used in chemometrics, we focus on the linear separability of target classes using linear discriminant analysis (LDA). This is a supervised method for finding a linear transformation which will maximize inter class separability and minimize intra class variability. We implement Fisher's criterion as a generalized eigenvalue decomposition problem.41 The proposed GA finds a feature subset using LDA such that the projected data will have maximum between class scatter Sb and minimum within class scatter Sw. At each generation, the optimal transformation matrix for the current feature subset is computed by maximizing Fisher's ratio:
(3) |
(4) |
(5) |
We first load the hyperspectral image as a feature matrix F ∈ S×B (Fig. 5a), where S is the number of samples (pixels) and B is the number of available bands (features). A population matrix P ∈ P×N is constructed with P genomes containing N features. The number of features N is tested for a user-specified range in order to determine the trade-off between accuracy and feature number. The population size P determines the number of permutations tested for each generation.
While a CPU-based algorithm requires each genome to be evaluated sequentially, our GPU-based approach compiles the entire population into a phenotype tensor H ∈ S×N×P, where each value in H is a feature value corresponding to the appropriate sample and genome index from the current population matrix P. This phenome matrix H is then used to calculate the class mean tensor Φ ∈ C×N×P and the global mean matrix M ∈ N×P (Fig. 5a). These tensors are stored in GPU memory and used to calculate the between class scatter Ψβ (Fig. 5b) and within class scatter Ψω (Fig. 5b) based on the following discrete formulations of eqn (4) and (5) given in section 2:
(6) |
(7) |
For each genome, the scatter matrices are in an N×N Hilbert space. Across the entire population, the corresponding tensors Ψβ and Ψω are N × N × P and can be calculated in parallel using eqn (6) and (7). Each element of Ψβ and Ψω are assigned a thread and computed independently. The results are stored in the corresponding tensors Ψβ and Ψω. This also enables selection of a high P, which provides faster convergence. Ψβ and Ψω consist of N × N scatter matrices Sb and Sw corresponding to each genome in the current population. Once Sb and Sw are calculated for each genome, the LDA projection basis T is computed as a generalized eigen decomposition problem using the LAPAKE library with CPU threads. Fisher's ratio is computed from the projected data and provides a score for each genome. All genomes in current population are ranked according to their fitness scores and used to create the next generation population (section 2).
Training was performed in two steps: (1) feature selection/extraction using PCA, mRMR, and GA-LDA and (2) training of a random forest classifier. In order to test performance for varying parameters, different classifiers were created by varying numbers of features. There was no overlap between training and validation arrays. For breast histology, training was performed on a 101-core breast array (BR-1003, US Biomax) containing samples from 40 tumor biopsies of various types (hyperplasia, atypical hyperplasia, ductal carcinoma, and lobular carcinoma) and 7 normal biopsies. For kidney histology, training was performed on a 100-core kidney cortex array (AMS701, AMBbio).
The 5-class kidney is validated on independent cores from TMA AMS701 and validation results for 5-class kidney classifier shows that GA-LDA can achieve a high level of accuracy for even small numbers of features (Fig. 6a). Also, 7-class breast histology model was validated using two independent arrays (BRC961 and BR1001, US Biomax) containing a total of 196 cores from 100 different patients which were imaged using FTIR microscopy. This model shows the clear effectiveness of GA-LDA for selecting features that are compatible with a larger number of classes (Fig. 6). Validation on FTIR images of two types of tissues 5-class kidney and 7-class breast shows that GA-LDA can achieve a high level of accuracy for even small numbers of features for more number of classes as well (Fig. 6a and b). This model is generally superior to PCA, likely due to the use of a supervised training set. This implies that linear separability measured using Fisher's ratio provides a better overall optimization than feature variance. While there is significant use of the PCA approach for classification assuming it is more informative, this example illustrates nicely that complex signatures may be more robustly classified by use of discrete features. While the use of discrete features for rapid classification is more than a decade old,26 the impetus for using discrete features was limited to speeding up post-acquisition processing. With faster computers and greater availability of storage, use of discrete features was not critical. Combined with the availability of DFIR imaging, however, the opportunity now exists to obtain more accurate classifications faster. This is particularly true for unusually complex chemical signatures, such as the differentiation of fibroblasts and myofibroblasts, as well as structures such as lymphocytes that are below the diffraction limit for IR. Finally, note that GA-LDA provides another advantage over PCA: the ability to translate features for use in DFIR imaging.
We test translation of GA-LDA features to DFIR using a 4-class model by validating on a 100-core (50 patient) array (AMS802, AMSbio). To demonstrate the methodology proposed here, instead of a comprehensive histologic analysis, we focused on testing of a 4-class model containing the most cancer-relevant cell types (epithelium, collagen, blood, and necrosis). The entire available spectrum (900–1800 cm−1) was imaged to create a database of bands that would be compatible with any features selected or extracted using mRMR, PCA, and GA-LDA. Performance in both the FTIR and DFIR images are shown (Fig. 6c and d). Classification of both FTIR and DFIR imaged tissue cores using only 20 features selected from FTIR imaged data is shown in (Fig. 3).
We demonstrate convergence of classifier performance for epithelium, which is the most important class for initial cancer diagnosis. The ROC curve (Fig. 7a) is shown for several feature selection values, and can be readily computed for large numbers of features on a desktop system. This allows a researcher or clinician to select the desired number of features to optimize for specificity, sensitivity, and image acquisition time.
While the proposed GA-LDA approach performs significantly better for larger numbers of classes, it also provides additional benefits when considering clinical translation of infrared spectroscopic imaging. Unlike feature extraction algorithms, such as PCA, GA-LDA features are compatible with DFIR instruments. GA-LDA algorithm also overcomes problem of getting trapped in local spectral region like mRMR as described in Fig. 4. In this figure, dotted line of mRMR sampling is a limited features space for second feature selection with mRMR after selection of first feature which is point on horizontal axis. This means mRMR algorithm cannot explore feature space beyond constrained region, whereas GA-LDA explores other regions in the feature space by introducing mutation in genomes. We also address execution time, which is the primary bottlneck for GA optimization, by providing a GPU-based implementation that can be run on inexpensive workstations available to both clinical and research laboratories.
We note that the proposed framework can be readily extended to manifold learning inspired fitness functions that may be more suitable when the class-conditional data are multi-modal or non-Gaussian. For instance, Local Fisher's ratio, which incorporates locality constraints in the Fisher's ratio through an affinity matrix41 ensure that multi-modal distributions remain multi-modal in the lower dimensional subspace and in the resulting set of selected features. In future work, we will consider this to overcome multi-modality in data set introduced by variations in tissue preparation and imaging environment.
In addition, feature selection makes DFIR imaging more practical if the desired classes are known a priori. This is particularly useful in a clinical environment, where there is a benefit to high-throughput screening of known diseases.
Finally, the proposed GPU-based implementation makes genetic algorithms far more practical to analytical scientists, since inexpensive hardware can be used to leverage a large computational benefit. The highly data-parallel nature of this problem allows full characterization of classifier performance across a large range of features, allowing biomedical researchers to carefully optimize for classification accuracy and imaging time. Together, the methods and results of this study propel analytical methods based on DFIR imaging forward to facilitate routine use of IR imaging for histopathology.
This journal is © The Royal Society of Chemistry 2018 |