Quantification of pharmaceuticals via transmission Raman spectroscopy: data sub-selection

Jonathan C. Burley *a, Adeyinka Aina a, Pavel Matousek b and Christopher Brignell c
aSchool of Pharmacy, Univeristy of Nottingham, Boots Science Building, NG7 2RD, UK. E-mail: jonathan.burley@nottingham.ac.uk; Fax: +44 (0) 115 951 5102; Tel: +44 (0) 115 84 68357
bCentral Laser Facility, Research Complex at Harwell, STFC Rutherford Appleton Laboratory, Oxfordshire OX11 0QX, UK
cSchool of Mathematical Sciences, University of Nottingham, NG7 2RD, UK

Received 5th July 2013 , Accepted 25th October 2013

First published on 25th October 2013


Abstract

We report the first systematic characterisation of data sub-selection with multivariate analysis to be applied to either TRS or the low-wavenumber Raman region. A model pharmaceutical formulation comprising two polymorphs mixed in the range of 1–99% is investigated. For data sub-selection, sparse partial least squares is for the first time applied to TRS data and compared with principal component analysis. It is found that low-wavenumber data (50–340 cm−1) are demonstrably superior for quantitative modelling than data in the more conventional mid-wavenumber range (340–2000 cm−1). Our results point the way to enhanced quantitative analytical capabilities for TRS, with potential application areas including pharmaceuticals, security and process-analytical technology, by combining data sub-selection with low-wavenumber-capable optics.


1 Introduction

In any spectroscopy experiment a choice is made about which spectral region to employ. Almost invariably a choice must be made before an experiment about which technique or data collection strategy is likely to be most suitable. This choice may be implicit, for example selection of a mid infra-red (mid-IR) spectrometer rather than a near infra-red (NIR) or teraHertz (THz) one. The choice of spectral region may also be explicit, for example selection of a particular grating setting and hence a wavenumber range in a Raman mapping experiment, or selection of a single wavenumber in a UV/vis experiment to allow a simple univariate concentration vs. absorbance graph to be plotted, instead of developing a more complex multi-variate model for calibration and analysis.

In other settings there may be a requirement to select only a subset of the data after the experiment has been performed (data reduction). This might be, for example, the selection of one chemically meaningful peak (e.g. carbonyl) for analysis, or it may be to produce as reliable and/or as simple a quantitative model as possible, or to reduce the mathematical and computational complexity of data analysis (e.g. Raman, NIR or mid-IR mapping with large hyper-spectral datasets).

Regardless of whether the decision on which spectral range to use is taken before or after data collection, it is important that the choice is informed and leads to an optimum output of information and creation of new knowledge.

A large body of work exists on the application of post-experiment variable selection, particularly in the field of NIR spectroscopy.1–6 A recent review article covers much of this work.7 The primary drivers for the use of data sub-selection are (i) chemical (e.g. selecting only wavelengths which correspond to the analyte of interest), (ii) physical (e.g. selecting a sub-region for which temperature dependence or humidity does not strongly affect the analysis), (iii) statistical (e.g. reducing the input of wavelengths with more noise than signal) and (iv) other requirements (e.g. simplifying models for translation between instruments, speed of analysis, computational requirements etc.).

In comparison with NIR spectroscopy, which tends to yield fairly broad peaks which are typically not directly linked to a particular chemical group (overtones and combination bands make up the majority of the spectrum8–10), data sub-selection in mid-IR, THz-IR and Raman spectroscopy has received far less attention.1–7,11–15 This emphasis on NIR likely arises due to the frequent requirement in NIR spectroscopy for chemometric methods to understand data. These methods are less common in mid-IR, THz-IR and Raman spectroscopies as the spectra produced are more amenable to direct interpretation.

Two styles of data sub-setting after data collection are reported in the literature. The more common style is the selection of a number of input variables (wavenumber and intensity pairs) which are non-contiguous (e.g.ref. 7 and references therein). The main aim associated with this sub-setting method is typically the reduction of input noise to the model and an associated increase in the accuracy, precision and reliability of the model. The potential disadvantage of the non-contiguous sub-setting is that it is likely to be specific to a particular problem (e.g. sample and/or spectroscopic technique). Less commonly, a continuous ‘spectral window’ can be selected (e.g.ref. 5). This method involves placing restrictions on the choice of wavenumber–intensity pairs and it is therefore likely to be less effective at noise removal but may be more transferable.

The motivation for the current paper is to examine data sub-selection in the context of pharmaceutical quality control, with an overall aim of determining whether sub-selection allows an improvement in quantitative ability. This is an area of major industrial and societal importance. We specifically focus our attention on two emerging aspects of Raman spectroscopy. Firstly, we examine the utility of low-wavenumber data. Traditionally, many Raman spectrometers have only been able to access the wavenumber region above 300 cm−1 unless more specialist equipment was used, access to the lower wavenumber end of the spectrum being limited by filters required to reject the very intense laser light (at 0 cm−1). In recent years improvements to the filters used for this have allowed easy access to data well below 100 cm−1 in many cases. This low-wavenumber spectral region contains a great deal of information on inter-molecular (rather than intra-molecular) vibrational bands, and can allow for a rapid assessment of crystalline vs. amorphous, salt vs. free base, co-crystal vs. physical mixture, polymorph identification etc. (e.g.ref. 16 and references therein). Second, we examine the application of data sub-selection to transmission Raman spectroscopy (TRS). Despite being initially reported17 in 1967, TRS has only become available routinely in the last few years (for a recent review see Buckley and Matousek18). In terms of applications, TRS can penetrate deeply (ca. 50 mm or more) into opaque samples, compared to backscattering Raman spectroscopy which is strongly biased to the near-surface areas (<1 mm or so). To the best of our knowledge this is the first report of data sub-selection applied to either TRS or low-wavenumber Raman data in a systematic way.

A limited amount of work has been reported by other groups centred around the problem of feature extraction and data selection in conventional back-scattering Raman spectroscopy. Madden and Ryder have demonstrated the use of machine learning techniques, with data sub-selection undertaken using neural networks and k-nearest neighbour approaches, for the quantification of cocaine hydrochloride in powder mixtures which also contained anhydrous D-glucose and caffeine.15 McShane and co-workers have reported a ‘peak-hopping stepwise feature selection method’ for sub-selection of spectral regions.14 Their method allowed a reduction in the average percentage error from 65.2%, when all data were used, to 9.4%, when a subset of data was employed, along with a clear improvement in the quantification (see Fig. 5 in their paper). The most relevant report with respect to the current contribution is from Strachan et al.,19 who considered a mixture of two polymorphs of carbamazepine in the range of 1–10% loading of form I in form III, to which principal component analysis and visual sub-selection of data were applied. They concluded that ‘… selection of spectral regions that contain visually detectable differences between the two forms appears to be the best data selection technique’. This is the hypothesis which we investigate in this work, by comparing sparse partial least squares data sub-selection directly with principal component analysis. The former method identifies the data points which provide the best quantitative model, and the latter identifies the data points which exhibit the most change as a function of composition. Whereas Strachan et al. employed conventional back-scattering Raman spectroscopy, we address this issue in the emerging areas of transmission Raman spectroscopy and low-wavenumber Raman scattering, for a full compositional range.

For the present study the samples comprise mixtures of two polymorphic forms of flufenamic acid (FFA) as powders. FFA is a non-steroidal anti-inflammatory drug. In the solid state it can adopt a number of polymorphs,20–23 of which forms I and III are stable and readily prepared through a simple solvent evaporation protocol. The crystal structures of forms I and III are known,22,23 and indicate that these are conformational polymorphs (different molecular conformations are adopted in the two polymorphs). A previous paper by some of the present authors examined the use of the full spectral range in quantification;24 the analyses reported here are of the same data, but with a focus on sub-selection of data.

The mathematical problem posed is to find a linear relationship between the composition and the spectrum for each composition measured at p wavenumbers. A linear model such as y = Xb, where y is a vector of the n composition values and X is the n × p design matrix of spectral values, cannot be solved for b using ordinary least squares (OLS) when p < n. In spectroscopy, however, p is usually several orders of magnitude bigger than n. Partial least squares25 is one dimension reduction technique for solving this ill-posed problem. However, neither OLS nor PLS identifies which predictor variables, in this case wavenumbers, are important with a large proportion of the variables contributing to the solution.

A sparse or parsimonious solution would constrain a subset of the values in b to be exactly zero, thus incorporating variable selection directly into the parameter estimation. Sparse solutions for ordinary least squares regression can be found using a ‘lasso’26,27 using an L1 penalty to shrink some entries in b to zero. A similar method for sparse partial least squares regression has recently been developed,28 thus achieving dimension reduction and variable selection simultaneously. We use this new technique to estimate the linear combination of a subset of wavenumbers which predict the composition with the smallest error.

Transmission Raman spectroscopy (TRS) is a relatively new form of spectroscopy with proven quantitative capability.17,18,29–32 Application areas include pharmaceuticals, security, quality control, etc. In quantitative analysis it is important to optimise the quantitative ability of the assay, with parameters including the limit of detection, limit of quantification and analytical accuracy and precision. All of these parameters are ultimately dependent upon the quality and analytical usefulness of the input dataset(s). To date there have been no systematic attempts to define how best to collect TRS data for quantitative analysis. The current work therefore attempts for the first time to address these considerations, via detailed analysis of a model dataset.

2 Experimental

Percentage compositions refer to the amount of form I in form III by mass (and molar ratio). Full details of the sample preparation and data collection have been published.24 For the analyses reported herein, routines written in the ‘R’ environment33 were employed, in addition to some freely available libraries.34,35 These routines and the raw data are included in the ESI, and an interested reader should therefore be able to reproduce exactly all of the work reported in this paper.

For the ‘continuous window’ analyses, partial least squares (PLS) models were built on sub-sets of the data, each sub-set being 100 cm−1 wide. The PLS models contained a single component as the variable; all data were variance-scaled and mean-centred before analysis and before plotting. A single variable was selected as this is expected to be physically meaningful (i.e. a single mixing parameter describes all compositions). Each data sub-set was 5 cm−1 distant from its neighbours, there is therefore a strong degree of overlap between adjacent sub-sets in terms of the wavelength range covered; this approach is justified wholly by the results obtained. In total, 467 individual PLS analyses were performed (from starting wavenumbers for individual sub-sets ranging from 50 to 2380). The quality of the fit was quantified by taking the sum of the squared residuals (Σ(actual − calculated)2) for the compositions, this is referred to as ‘goodness-of-fit’ throughout. A low value for the goodness-of-fit indicates a good linear quantitative model. The approach is somewhat similar to that taken by Jiang et al.,5 with the exception that no change to the number of variables is made for different windows.

For the ‘non-continuous’ analyses, two approaches were taken. Firstly, the sparse-PLS method26,27 was employed to identify the 100 most important data points for quantification. The sparse-PLS method explicitly considers both the composition and spectral matrix, and selects the wavenumbers which give the best relationship between the two. Secondly, principal component analysis (PCA) was used, with the aim of identifying the spectral regions which exhibit the greatest variance. PCA only considers the spectral data, not the compositions, and was employed to identify the wavenumbers in the spectral data which give the greatest variance in the spectra. The 100 data points with the greatest magnitudes in the loading trace for the first principal component were selected, on the basis that the best model is likely to arise from the data points which provide the greatest variance between spectra.

3 Results

Data sub-setting is applied by using a continuous spectral window, and by selecting a discontinuous sub-set of data points. The results are compared and conclusions drawn. For both continuous and non-continuous sub-sets of wavenumbers, a decision must be made regarding the degree of sub-setting, i.e. the number of data points or the range of wavenumber values which is selected for analysis. The results detailed below cover exemplar values for the number of selected data points (100) and the size of the wavenumber window used (100 cm−1); analyses for some other values are reported in the ESI, and interested readers can of course modify the computer code in the ESI to analyse any arbitrary sub-set of data points if they so wish.

3.1 Selection of a continuous sub-set of data

Unsurprisingly, the use of a carefully chosen sub-set of data allowed for a numerically superior PLS model when compared with an analysis of all data. For analysis of all data the goodness-of-fit is 0.0214, for the optimum 100 cm−1 wide sub-window it falls to 0.00331. The root-mean-square error of prediction (RMSEP) for the composition is 3.00% for all data and 1.18% for the optimum data sub-set, rising to 21.99% for the worst data sub-set.

Fig. 1a–f outline the results of applying a continuous 100 cm−1 wide window to a subset and then quantifying the data. The subset of data which provide optimum quantification (of the 467 ranges examined) is highlighted (Fig. 1a–d). It covers the spectral range 85–185 cm−1, which corresponds to low-wavenumber, inter-molecular phonon-mode vibrations. Visual inspection of the spectra in this window (Fig. 1c) indicates that two peaks centred around 120 and 145 cm−1 are the main spectral features. The position of these two peaks varies quite strongly as a function of composition, and it therefore seems reasonable that this window provides a good opportunity for quantification. The PLS model for this region (observed vs. calculated) exhibits an excellent linear trace, with the observed being nearly equal to the calculated at all measured compositions (Fig. 1d).


image file: c3an01293j-f1.tif
Fig. 1 Data sub-selection using a continuous 100 cm−1 wide window, as labelled. Offsets differ between c and e for presentational reasons and are provided as a function of composition. Compositions range from 1% to 99% with increasing offset, i.e. the sample with 1% loading is at the bottom of the plot, the 99% sample is at the top.

Of the remaining wavenumber range, the 100 cm−1 windows centred around 550–845, 950–1040 and 1150–1510 also provide good quantification (not shown directly, goodness-of-fit values are indicated in Fig. 1b). Again these spectral regions contain several clear peaks, in this case related to intra-molecular vibrations rather than inter-molecular vibrations, and it seems reasonable that they should therefore provide a good basis for quantification.

The worst sub-region for quantification is 805–905 cm−1, with the analysis results highlighted in Fig. 1a, b, e and f. This region contains several fairly weak peaks, which do appear to change in nature as a function of composition, albeit what appears to the eye to be in a non-uniform manner. Somewhat surprisingly, this spectral region provides a worse set of data for quantification than the region above 1800 cm−1 which contains no peaks whatsoever. Given the poor quality of the fits for the worst sub-region and for the region above 1800 cm−1 there seems little merit in analysing this in excessive detail. It seems likely that the changes above 1800 cm−1 may relate to background changes arising from slight differences in the fluorescence from the two polymorphs which allow some degree of quantification. The reason for the poor quantification of the 805–905 cm−1 region is less easy to rationalise. Trace levels of impurities can probably be ruled out as a cause of the poor quantification, as the 27 samples were prepared by simple mixing of two starting materials, impurity levels would therefore be expected to vary systematically in the same way as the composition. It is possible that a weak surface-related effect, either segregation of the two polymorphs (as noted by Aina et al.24) or re-crystallisation of one of the other polymorphs, may be occurring, or of course that these weak peaks could simply be spurious and arise from some unexpected scattering of light within the system.

3.2 Selection of a non-continuous sub-set of data

The 100 most useful data points (wavenumber values) for quantification as defined by the sparse-PLS method are shown along with the spectra in Fig. 2 (red lines), with the 100 data points exhibiting the most variance between datasets derived from PCA also shown (black lines). The majority of the points lie at low wavenumber: 89 of the 100 data points lie below 340 cm−1, with the remaining 11 points occurring around 785 and 1000 cm−1, where the two most intense intra-molecular Raman peaks appear. A comparison of the sparse-PLS results with the PCA results, which indicate the 100 data points with the greatest variance between samples, indicates that 99 of the 100 data points are common to PCA and the sparse-PLS method (the sparse-PLS method includes the data point at 780.1 cm−1, and the PCA method includes the data point at 56.6 cm−1). The utility of low-wavenumber Raman spectroscopy for the analysis of solids has been noted before, both by ourselves and other authors,16,36,37 on qualitative as well as quantitative grounds. It is interesting in the current case that the low-wavenumber data are so dominant; clearly for the case of two polymorphs like FFA we might expect these inter-molecular bands to be important, however the fact that our samples comprise two conformational polymorphs (with the molecules adopting different configurations in the solid-state) might have suggested that mid-wavenumber data (340–2000 cm−1) covering the intra-molecular vibrational bands would be important too. This may be the case, but the key conclusion from this part of our study is that the low-wavenumber inter-molecular bands are more important, regardless of the expected impact of conformational differences.
image file: c3an01293j-f2.tif
Fig. 2 Data sub-selection using non-continuous 100 data point selection. Vertical tick marks indicate the wavenumbers of the optimal set of 100 data points for quantification, with sparse-PLS results indicated in red, and PCA in black. Spectral offsets are provided as a function of composition. Compositions range from 1% to 99% with increasing offset, i.e. the sample with 1% loading is at the bottom of the plot, the 99% sample is at the top.

4 Discussion

The analyses using both continuous and non-continuous data sub-selection clearly demonstrated the utility of low-wavenumber data in developing optimal quantitative models in pharmaceutical analysis. This was particularly marked for the non-continuous sub-selection, with the vast majority of the important data points occurring below 340 cm−1.

The conclusion of Strachan et al.…selection of spectral regions that contain visually detectable differences between the two forms appears to be the best data selection technique19 appears to hold fairly well in our studies; the important wavenumbers highlighted were indeed spectral regions with obvious visual differences between the two polymorphs, as shown by the concurrence of PCA and sparse-PLS results. However some caution is required in applying the ‘visually detectable difference rule’, in that our worst spectral region for quantification also contained several visually detectable differences between the two forms, but (rather surprisingly) provided worse quantification than a spectral region which contained no peaks at all from either form. We note that the peaks in this ‘worst region’ were extremely weak (around 2–3 orders of magnitude weaker than the most intense peaks). The great benefit of employing a sparse-PLS method is that any subjectivity about which particular peaks to select is avoided. The sparse-PLS approach instead directly identifies data points (wavenumbers) which show the strongest correlation with the composition data.

5 Conclusion

For the first time a detailed analysis of data sub-selection has been undertaken for a set of model transmission Raman data. The main conclusions are: (i) sub-setting of TRS data has the potential to improve the analytical accuracy of TRS; (ii) low wavenumber data appear to provide the best route to quantification of this model pharmaceutical system at least, with the results likely to be more generally applicable. The applicability of our findings to non-crystalline systems however may be less direct, with amorphous materials exhibiting a broad boson band at low frequency rather than the sharp peaks from the crystalline samples in our study. Areas for likely impact from our work include further instrument development (e.g. uptake of ultra-low filters for laser light rejection with TRS and access to low-wavenumber data), mapping of pharmaceutical samples (in which spectral sub-selection takes place prior to the experiment), and industrial quality control applications.

References

  1. C. H. Spiegelman, M. J. McShane, M. J. Goetz, M. Motamedi, Q. L. Yue and G. L. Cot, Anal. Chem., 1998, 70, 35–44 CrossRef CAS PubMed.
  2. P. J. Brown, J. Chemom., 1992, 6, 151–161 CrossRef CAS.
  3. J. M. Brenchley, U. Horchner and J. H. Kalivas, Appl. Spectrosc., 1997, 51, 689–699 CrossRef CAS.
  4. A. S. Bangalore, R. E. Shaffer, G. W. Small and M. A. Arnold, Anal. Chem., 1996, 68, 4200–4212 CrossRef CAS.
  5. J. Jiang, R. J. Berry, H. W. Siesler and Y. Ozaki, Anal. Chem., 2002, 74, 3555–3565 CrossRef CAS.
  6. K. Zheng, Q. Li, J. Wang, J. Geng, P. Cao, T. Sui, X. Wang and Y. Du, Chemom. Intell. Lab. Syst., 2012, 112, 48–54 CrossRef CAS PubMed.
  7. Z. Xiaobo, Z. Jiewen, M. J. W. Povey, M. Holmes and M. Hanpin, Anal. Chim. Acta, 2010, 667, 14–32 CrossRef PubMed.
  8. L. E. Agelet and C. R. Hurburgh, Crit. Rev. Anal. Chem., 2010, 40, 246–260 CrossRef CAS.
  9. G. W. Small, TrAC, Trends Anal. Chem., 2006, 25, 1057–1066 CrossRef CAS PubMed.
  10. G. Reich, Adv. Drug Delivery Rev., 2005, 57, 1109–1143 CrossRef CAS PubMed.
  11. R. J. Anderegg and D. J. Pyo, Anal. Chem., 1987, 59, 1914–1917 CrossRef CAS.
  12. U. Hrchner and J. H. Kalivas, J. Chemom., 1995, 9, 283–308 CrossRef.
  13. U. Hrchner and J. H. Kalivas, Anal. Chim. Acta, 1995, 311, 1–13 CrossRef.
  14. M. J. McShane, B. D. Cameron, G. L. Cot, M. Motamedi and C. H. Spiegelman, Anal. Chim. Acta, 1999, 388, 251–264 CrossRef CAS.
  15. M. G. Madden and A. G. Ryder, Proc. SPIE, 2003, 4876, 1130–1139 CrossRef PubMed.
  16. A. Brillante, I. Bilotti, R. G. D. Valle, E. Venuti and A. Girlando, CrystEngComm, 2008, 10, 937–946 RSC.
  17. B. Schrader and G. Bergmann, Fresenius' Z. Anal. Chem., 1967, 225, 230–247 CrossRef CAS.
  18. K. Buckley and P. Matousek, Analyst, 2011, 136, 3039–3050 RSC.
  19. C. J. Strachan, D. Pratiwi, K. C. Gordon and T. Rades, J. Raman Spectrosc., 2004, 35, 347–352 CrossRef CAS.
  20. J. Krc, Microscope, 1977, 25, 31–34 CAS.
  21. E. H. Lee, S. X. M. Boerrigter, A. C. F. Rumondor, S. P. Chamarthy and S. R. Byrn, Cryst. Growth Des., 2008, 8, 91–97 CAS.
  22. H. M. Krishna Murthy, T. N. Bhat and M. Vijayan, Acta Crystallogr., Sect. B: Struct. Crystallogr. Cryst. Chem., 1982, 38, 315317 CrossRef.
  23. J. F. McConnell, Cryst. Struct. Commun., 1973, 3, 459–461 Search PubMed.
  24. A. Aina, M. D. Hargreaves, P. Matousek and J. C. Burley, Analyst, 2010, 135, 2328–2333 RSC.
  25. S. de Jong, Chemom. Intell. Lab. Syst., 1993, 18, 251–263 CrossRef CAS.
  26. R. Tibshirani, J. Roy. Stat. Soc. B Stat. Meth., 1994, 58, 267288 Search PubMed.
  27. B. Efron, Ann. Math. Stat., 2004, 32, 407–499 CrossRef PubMed.
  28. H. Chun and S. Kele, J. Roy. Stat. Soc. B Stat. Meth., 2010, 72, 3–25 CrossRef PubMed.
  29. P. Matousek, Appl. Spectrosc., 2007, 61, 845–854 CrossRef CAS PubMed.
  30. N. A. Macleod and P. Matousek, Pharm. Res., 2008, 25, 2205–2215 CrossRef CAS PubMed.
  31. C. Eliasson, N. A. Macleod, L. C. Jayes, F. C. Clarke, S. V. Hammond, M. R. Smith and P. Matousek, J. Pharm. Biomed. Anal., 2008, 47, 221–229 CrossRef CAS PubMed.
  32. P. Matousek and A. W. Parker, Appl. Spectrosc., 2006, 60, 1353–1357 CrossRef CAS PubMed.
  33. R Development Core Team, R: A Language and Environment for Statistical Computing, 2012 Search PubMed.
  34. D. Chung, H. Chun and S. Keles, spls: Sparse Partial Least Squares (SPLS) Regression and Classification, 2012 Search PubMed.
  35. W. Stacklies, H. Redestig, M. Scholz, D. Walther and J. Selbig, Bioinformatics, 2007, 23, 1164–1167 CrossRef CAS PubMed.
  36. S. Al-Dulaimi, A. Aina and J. Burley, CrystEngComm, 2010, 12, 1038–1040 RSC.
  37. A. Alkhalil, J. B. Nanubolu, C. J. Roberts, J. W. Aylott and J. C. Burley, Cryst. Growth Des., 2011, 11, 422–430 CAS.

Footnotes

Electronic supplementary information (ESI) available. See DOI: 10.1039/c3an01293j
Linking compositions with spectral variance can be performed using principal component regression (PCR). From a practical perspective PCR and PLS are essentially equivalent; PLS was preferred in the current work as it is more commonly used.

This journal is © The Royal Society of Chemistry 2014