Jonathan C.
Burley
*a,
Adeyinka
Aina
a,
Pavel
Matousek
b and
Christopher
Brignell
c
aSchool of Pharmacy, Univeristy of Nottingham, Boots Science Building, NG7 2RD, UK. E-mail: jonathan.burley@nottingham.ac.uk; Fax: +44 (0) 115 951 5102; Tel: +44 (0) 115 84 68357
bCentral Laser Facility, Research Complex at Harwell, STFC Rutherford Appleton Laboratory, Oxfordshire OX11 0QX, UK
cSchool of Mathematical Sciences, University of Nottingham, NG7 2RD, UK
First published on 25th October 2013
We report the first systematic characterisation of data sub-selection with multivariate analysis to be applied to either TRS or the low-wavenumber Raman region. A model pharmaceutical formulation comprising two polymorphs mixed in the range of 1–99% is investigated. For data sub-selection, sparse partial least squares is for the first time applied to TRS data and compared with principal component analysis. It is found that low-wavenumber data (50–340 cm−1) are demonstrably superior for quantitative modelling than data in the more conventional mid-wavenumber range (340–2000 cm−1). Our results point the way to enhanced quantitative analytical capabilities for TRS, with potential application areas including pharmaceuticals, security and process-analytical technology, by combining data sub-selection with low-wavenumber-capable optics.
In other settings there may be a requirement to select only a subset of the data after the experiment has been performed (data reduction). This might be, for example, the selection of one chemically meaningful peak (e.g. carbonyl) for analysis, or it may be to produce as reliable and/or as simple a quantitative model as possible, or to reduce the mathematical and computational complexity of data analysis (e.g. Raman, NIR or mid-IR mapping with large hyper-spectral datasets).
Regardless of whether the decision on which spectral range to use is taken before or after data collection, it is important that the choice is informed and leads to an optimum output of information and creation of new knowledge.
A large body of work exists on the application of post-experiment variable selection, particularly in the field of NIR spectroscopy.1–6 A recent review article covers much of this work.7 The primary drivers for the use of data sub-selection are (i) chemical (e.g. selecting only wavelengths which correspond to the analyte of interest), (ii) physical (e.g. selecting a sub-region for which temperature dependence or humidity does not strongly affect the analysis), (iii) statistical (e.g. reducing the input of wavelengths with more noise than signal) and (iv) other requirements (e.g. simplifying models for translation between instruments, speed of analysis, computational requirements etc.).
In comparison with NIR spectroscopy, which tends to yield fairly broad peaks which are typically not directly linked to a particular chemical group (overtones and combination bands make up the majority of the spectrum8–10), data sub-selection in mid-IR, THz-IR and Raman spectroscopy has received far less attention.1–7,11–15 This emphasis on NIR likely arises due to the frequent requirement in NIR spectroscopy for chemometric methods to understand data. These methods are less common in mid-IR, THz-IR and Raman spectroscopies as the spectra produced are more amenable to direct interpretation.
Two styles of data sub-setting after data collection are reported in the literature. The more common style is the selection of a number of input variables (wavenumber and intensity pairs) which are non-contiguous (e.g.ref. 7 and references therein). The main aim associated with this sub-setting method is typically the reduction of input noise to the model and an associated increase in the accuracy, precision and reliability of the model. The potential disadvantage of the non-contiguous sub-setting is that it is likely to be specific to a particular problem (e.g. sample and/or spectroscopic technique). Less commonly, a continuous ‘spectral window’ can be selected (e.g.ref. 5). This method involves placing restrictions on the choice of wavenumber–intensity pairs and it is therefore likely to be less effective at noise removal but may be more transferable.
The motivation for the current paper is to examine data sub-selection in the context of pharmaceutical quality control, with an overall aim of determining whether sub-selection allows an improvement in quantitative ability. This is an area of major industrial and societal importance. We specifically focus our attention on two emerging aspects of Raman spectroscopy. Firstly, we examine the utility of low-wavenumber data. Traditionally, many Raman spectrometers have only been able to access the wavenumber region above 300 cm−1 unless more specialist equipment was used, access to the lower wavenumber end of the spectrum being limited by filters required to reject the very intense laser light (at 0 cm−1). In recent years improvements to the filters used for this have allowed easy access to data well below 100 cm−1 in many cases. This low-wavenumber spectral region contains a great deal of information on inter-molecular (rather than intra-molecular) vibrational bands, and can allow for a rapid assessment of crystalline vs. amorphous, salt vs. free base, co-crystal vs. physical mixture, polymorph identification etc. (e.g.ref. 16 and references therein). Second, we examine the application of data sub-selection to transmission Raman spectroscopy (TRS). Despite being initially reported17 in 1967, TRS has only become available routinely in the last few years (for a recent review see Buckley and Matousek18). In terms of applications, TRS can penetrate deeply (ca. 50 mm or more) into opaque samples, compared to backscattering Raman spectroscopy which is strongly biased to the near-surface areas (<1 mm or so). To the best of our knowledge this is the first report of data sub-selection applied to either TRS or low-wavenumber Raman data in a systematic way.
A limited amount of work has been reported by other groups centred around the problem of feature extraction and data selection in conventional back-scattering Raman spectroscopy. Madden and Ryder have demonstrated the use of machine learning techniques, with data sub-selection undertaken using neural networks and k-nearest neighbour approaches, for the quantification of cocaine hydrochloride in powder mixtures which also contained anhydrous D-glucose and caffeine.15 McShane and co-workers have reported a ‘peak-hopping stepwise feature selection method’ for sub-selection of spectral regions.14 Their method allowed a reduction in the average percentage error from 65.2%, when all data were used, to 9.4%, when a subset of data was employed, along with a clear improvement in the quantification (see Fig. 5 in their paper). The most relevant report with respect to the current contribution is from Strachan et al.,19 who considered a mixture of two polymorphs of carbamazepine in the range of 1–10% loading of form I in form III, to which principal component analysis and visual sub-selection of data were applied. They concluded that ‘… selection of spectral regions that contain visually detectable differences between the two forms appears to be the best data selection technique’. This is the hypothesis which we investigate in this work, by comparing sparse partial least squares data sub-selection directly with principal component analysis. The former method identifies the data points which provide the best quantitative model, and the latter identifies the data points which exhibit the most change as a function of composition. Whereas Strachan et al. employed conventional back-scattering Raman spectroscopy, we address this issue in the emerging areas of transmission Raman spectroscopy and low-wavenumber Raman scattering, for a full compositional range.
For the present study the samples comprise mixtures of two polymorphic forms of flufenamic acid (FFA) as powders. FFA is a non-steroidal anti-inflammatory drug. In the solid state it can adopt a number of polymorphs,20–23 of which forms I and III are stable and readily prepared through a simple solvent evaporation protocol. The crystal structures of forms I and III are known,22,23 and indicate that these are conformational polymorphs (different molecular conformations are adopted in the two polymorphs). A previous paper by some of the present authors examined the use of the full spectral range in quantification;24 the analyses reported here are of the same data, but with a focus on sub-selection of data.
The mathematical problem posed is to find a linear relationship between the composition and the spectrum for each composition measured at p wavenumbers. A linear model such as y = Xb, where y is a vector of the n composition values and X is the n × p design matrix of spectral values, cannot be solved for b using ordinary least squares (OLS) when p < n. In spectroscopy, however, p is usually several orders of magnitude bigger than n. Partial least squares25 is one dimension reduction technique for solving this ill-posed problem. However, neither OLS nor PLS identifies which predictor variables, in this case wavenumbers, are important with a large proportion of the variables contributing to the solution.
A sparse or parsimonious solution would constrain a subset of the values in b to be exactly zero, thus incorporating variable selection directly into the parameter estimation. Sparse solutions for ordinary least squares regression can be found using a ‘lasso’26,27 using an L1 penalty to shrink some entries in b to zero. A similar method for sparse partial least squares regression has recently been developed,28 thus achieving dimension reduction and variable selection simultaneously. We use this new technique to estimate the linear combination of a subset of wavenumbers which predict the composition with the smallest error.
Transmission Raman spectroscopy (TRS) is a relatively new form of spectroscopy with proven quantitative capability.17,18,29–32 Application areas include pharmaceuticals, security, quality control, etc. In quantitative analysis it is important to optimise the quantitative ability of the assay, with parameters including the limit of detection, limit of quantification and analytical accuracy and precision. All of these parameters are ultimately dependent upon the quality and analytical usefulness of the input dataset(s). To date there have been no systematic attempts to define how best to collect TRS data for quantitative analysis. The current work therefore attempts for the first time to address these considerations, via detailed analysis of a model dataset.
For the ‘continuous window’ analyses, partial least squares (PLS) models were built on sub-sets of the data, each sub-set being 100 cm−1 wide. The PLS models contained a single component as the variable; all data were variance-scaled and mean-centred before analysis and before plotting. A single variable was selected as this is expected to be physically meaningful (i.e. a single mixing parameter describes all compositions). Each data sub-set was 5 cm−1 distant from its neighbours, there is therefore a strong degree of overlap between adjacent sub-sets in terms of the wavelength range covered; this approach is justified wholly by the results obtained. In total, 467 individual PLS analyses were performed (from starting wavenumbers for individual sub-sets ranging from 50 to 2380). The quality of the fit was quantified by taking the sum of the squared residuals (Σ(actual − calculated)2) for the compositions, this is referred to as ‘goodness-of-fit’ throughout. A low value for the goodness-of-fit indicates a good linear quantitative model. The approach is somewhat similar to that taken by Jiang et al.,5 with the exception that no change to the number of variables is made for different windows.
For the ‘non-continuous’ analyses, two approaches were taken. Firstly, the sparse-PLS method26,27 was employed to identify the 100 most important data points for quantification. The sparse-PLS method explicitly considers both the composition and spectral matrix, and selects the wavenumbers which give the best relationship between the two. Secondly, principal component analysis (PCA) was used, with the aim of identifying the spectral regions which exhibit the greatest variance. PCA only considers the spectral data, not the compositions,‡ and was employed to identify the wavenumbers in the spectral data which give the greatest variance in the spectra. The 100 data points with the greatest magnitudes in the loading trace for the first principal component were selected, on the basis that the best model is likely to arise from the data points which provide the greatest variance between spectra.
Fig. 1a–f outline the results of applying a continuous 100 cm−1 wide window to a subset and then quantifying the data. The subset of data which provide optimum quantification (of the 467 ranges examined) is highlighted (Fig. 1a–d). It covers the spectral range 85–185 cm−1, which corresponds to low-wavenumber, inter-molecular phonon-mode vibrations. Visual inspection of the spectra in this window (Fig. 1c) indicates that two peaks centred around 120 and 145 cm−1 are the main spectral features. The position of these two peaks varies quite strongly as a function of composition, and it therefore seems reasonable that this window provides a good opportunity for quantification. The PLS model for this region (observed vs. calculated) exhibits an excellent linear trace, with the observed being nearly equal to the calculated at all measured compositions (Fig. 1d).
Of the remaining wavenumber range, the 100 cm−1 windows centred around 550–845, 950–1040 and 1150–1510 also provide good quantification (not shown directly, goodness-of-fit values are indicated in Fig. 1b). Again these spectral regions contain several clear peaks, in this case related to intra-molecular vibrations rather than inter-molecular vibrations, and it seems reasonable that they should therefore provide a good basis for quantification.
The worst sub-region for quantification is 805–905 cm−1, with the analysis results highlighted in Fig. 1a, b, e and f. This region contains several fairly weak peaks, which do appear to change in nature as a function of composition, albeit what appears to the eye to be in a non-uniform manner. Somewhat surprisingly, this spectral region provides a worse set of data for quantification than the region above 1800 cm−1 which contains no peaks whatsoever. Given the poor quality of the fits for the worst sub-region and for the region above 1800 cm−1 there seems little merit in analysing this in excessive detail. It seems likely that the changes above 1800 cm−1 may relate to background changes arising from slight differences in the fluorescence from the two polymorphs which allow some degree of quantification. The reason for the poor quantification of the 805–905 cm−1 region is less easy to rationalise. Trace levels of impurities can probably be ruled out as a cause of the poor quantification, as the 27 samples were prepared by simple mixing of two starting materials, impurity levels would therefore be expected to vary systematically in the same way as the composition. It is possible that a weak surface-related effect, either segregation of the two polymorphs (as noted by Aina et al.24) or re-crystallisation of one of the other polymorphs, may be occurring, or of course that these weak peaks could simply be spurious and arise from some unexpected scattering of light within the system.
The conclusion of Strachan et al. ‘…selection of spectral regions that contain visually detectable differences between the two forms appears to be the best data selection technique’19 appears to hold fairly well in our studies; the important wavenumbers highlighted were indeed spectral regions with obvious visual differences between the two polymorphs, as shown by the concurrence of PCA and sparse-PLS results. However some caution is required in applying the ‘visually detectable difference rule’, in that our worst spectral region for quantification also contained several visually detectable differences between the two forms, but (rather surprisingly) provided worse quantification than a spectral region which contained no peaks at all from either form. We note that the peaks in this ‘worst region’ were extremely weak (around 2–3 orders of magnitude weaker than the most intense peaks). The great benefit of employing a sparse-PLS method is that any subjectivity about which particular peaks to select is avoided. The sparse-PLS approach instead directly identifies data points (wavenumbers) which show the strongest correlation with the composition data.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c3an01293j |
‡ Linking compositions with spectral variance can be performed using principal component regression (PCR). From a practical perspective PCR and PLS are essentially equivalent; PLS was preferred in the current work as it is more commonly used. |
This journal is © The Royal Society of Chemistry 2014 |