Open Access Article
Batuhan
Yildirim
abc,
James
Doutch
b and
Jacqueline M.
Cole
*abc
aCavendish Laboratory, Department of Physics, University of Cambridge, J.J. Thomson Avenue, Cambridge, CB3 0HE, UK. E-mail: jmc61@cam.ac.uk
bISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Didcot, Oxfordshire, OX11 0QX, UK
cResearch Complex at Harwell, Rutherford Appleton Laboratory, Didcot, Oxfordshire, OX11 0FA, UK
First published on 12th March 2024
Machine learning (ML) can be employed at the data-analysis stage of small-angle scattering (SAS) experiments. This could assist in the characterization of nanomaterials and biological samples by providing accurate data-driven predictions of their structural parameters (e.g. particle shape and size) directly from their SAS profiles. However, the unique nature of SAS data presents several challenges to such a goal. For instance, one would need to develop a means of specifying an input representation and ML model that are suitable for processing SAS data. Furthermore, the lack of large open datasets for training such models is a significant barrier. We demonstrate an end-to-end multi-task system for jointly classifying SAS data into scattering-model classes and predicting their parameters. We suggest a scale-invariant representation for SAS intensities that makes the system robust to the units of the input and arbitrary unknown scaling factors, and compare this empirically to two other input representations. To address the lack of available experimental datasets, we create and train our proposed model on 1.1 million theoretical SAS intensities which we make publicly available. These span 55 scattering-model classes with a total of 219 structural parameters. Finally, we discuss applications, limitations and the potential for such a model to be integrated into SAS-data-analysis software.
Machine-learning (ML) models that take as input SAS intensities and estimate scattering-model classes and their parameters have the potential to be integrated into SAS-data analysis software, although there are several challenges that must first be addressed. The lack of large sets of experimental data labeled with scattering-model classes and their parameters means that such models must be trained on synthetic data, which raises questions of generalizability. Additionally, SAS intensities in their raw form are not particularly suitable as inputs to ML models. They can span several orders of magnitude in intensity and have arbitrary scale factors and additive background shifts that make it difficult to attribute a single SAS intensity to a particular scattering-model class and set of parameters. For such a system to be practically useful as a data-analysis tool, uncertainty quantification is likely to be an important feature. In cases where the input experimental data are noisy, or when the data result from a structure with a scattering-model class that is unknown to the ML model, we should expect high uncertainty values. This will allow sensible conclusions to be drawn in the cases where input experimental data are out-of-distribution relative to the training data. Small-angle neutron scattering (SANS) data pose additional challenges as instrumental resolution functions (i.e., smearing parameters), that are unique to individual SANS instruments, can cause variability in data resulting from the same sample measured using different SANS instruments.
We endeavor to address some of these challenges in this work. Thereby, we create the SAS-55M-20k dataset, consisting of 1.1 million theoretical 1-D SAS intensities with corresponding discrete scattering-model classes and their continuous parameters.5 This dataset is made publicly available alongside this paper, including the training and test splits to ease benchmarking of any future models trained on this dataset. Full details on the construction and composition of the dataset can be found in the ESI.† We propose a simple scale invariant representation of SAS intensities that is suitable for input to ML algorithms, and empirically compare this to two other scale invariant representations (including one which is commonly used in the literature). Primarily, we present a multi-task transformer neural network that takes as input a 1-D SAS intensity and jointly estimates the scattering-model class that produced it (classification), as well as the continuous parameters of the scattering model (regression), and evaluate its performance on both tasks. Finally, we discuss some limitations and propose ways in which the data and model may be improved to be more applicable to real 1-D SAS data.
000 idealized SAS intensity functions of dilute (i.e., structure factor S(q) = 1) systems from each class with parameters sampled randomly in each case. We have ensured that general parameters common to all model classes, including polydispersity and scattering length densities for solvents and model-specific features, were randomly sampled for each instance in our dataset. This effectively ensures that models trained on this data, including ours, are robust to varying ranges of polydispersities and contrasts. In each case, the scattering intensities were generated with fixed volume fractions (i.e., sample concentrations) and background constants (full details on how parameters were sampled can be found in Archibald et al.7). The total number of scattering-model parameters across all 55 classes is 219. Some model classes were excluded because they were computationally infeasible (i.e., too slow) to compute or caused out-of-memory errors. The resulting dataset comprised 1.1 million samples, which we split into training and test data using a 80
:
20 split. We call this the SAS-55M-20k dataset, and we provide both training and test sets through this publication to facilitate further work on applying ML and other computational methods to 1-D SAS intensities with known parameters. Details on how to download the SAS-55M-20k dataset5 can be found at https://github.com/by256/sasformer.
sin(θ)/λ, where θ is the half-angle between the incident beam and a detector. Using raw I(q) data as input to neural networks presents some challenges. SAS intensities can span several orders of magnitude in scale, while neural networks typically work best with standardized inputs that have zero mean and unit variance. Moreover, the scale of I(q) depends on arbitrary scaling factors that result from experimental or sample conditions as well as the units used to describe the function. Consequently, although a structural system may be characterized by a single model function specified by a particular set of parameters, various scale factors and units can afford a potentially large number of I(q) functions that could describe the same system. Thus, a scale invariant input representation is required.
To achieve scale invariance, we apply what we call a “quotient transform” to I(q). If I(qn) and I(qn+1) are the scattering intensities at indices n and n + 1 of a discretized I(q) function, the quotient transform is
. By applying this transformation, any scale factors (such as those occurring from sample concentrations) are cancelled as can be seen by considering if sI(q) is a scattered intensity scaled by an arbitrary scale factor, s, then the quotient transform of sI(q) is
. Formally, the quotient transform of I(q) is expressed as:
![]() | (1) |
In practice, we take the logarithm of the quotient transform to reduce the variability between classes of scattering-model functions that are present in the data. This can be interpreted as a difference transform in log-space since
. Examples of SAS intensities and their log-quotient transforms are shown in Fig. 2.
Besides enabling inputs to be tokenized, discretizing the input has two other notable benefits. Firstly, by forcing the input values into a set number of bins, small fluctuations in the input are eliminated, increasing the signal-to-noise ratio. This can be particularly useful for experimental SAS data which tend to be noisy. The second benefit is that discretizing the input makes any subsequent ML model trained on these inputs somewhat robust to background shifts that are commonly present in experimentally obtained SAS data. These additive background constants are often observed to be in the range of 0.001 to 0.01 and the discretizing process neutralizes them since they are so small relative to the range of values that is spanned by SAS data, particularly at lower q values. In our method, quotient transformed data are discretized using an ordinal encoding scheme and a quantile-based binning strategy. The ordinal encoding scheme transforms continuous input values into discrete bins represented by integer values, where each bin corresponds to a specific range of the continuous input. The quantile-based binning strategy ensures that an equal number of data points are in each bin. This is particularly useful for handling skewed distributions, which some portions SAS intensities can be when compared across scattering models and across the entire dataset.
We now provide a high level overview of transformers and attention in the context of our model's architecture, but we refer the reader to Vaswani et al.24 and Jeagle et al.26 for full details on attention and Perceiver IO, respectively. The SASformer architecture (Fig. 3) consists of (a) an encoder that maps inputs to a space with a smaller number of latent feature vectors for efficient processing; (b) two decoders – one for classifying the scattering-model function and another for predicting the functional parameters of the scattering model. The main component of both the encoder and decoders is the attention mechanism,
![]() | (2) |
| K = XWK; V = XWV; | (3) |
![]() | (4) |
is the softmax function and dK is the feature dimensionality of K. Q, K and V are queries, keys and values, respectively, which can vary depending on whether self-attention or cross-attention is being performed on the inputs (eqn (2) and (3)). The former entails linear projections of the input array by a parameter matrix, hence the name ‘self’-attention, since queries, keys and values are all computed from the same input. The latter involves keys and values that are computed from the input array, while the query is either computed from a distinct and separate input (Z in case 2, eqn (2)) or it can be a learnable parameter matrix (θQ in case 3, eqn (2)); thus it is known as cross-attention. Following Perceiver IO, the SASformer architecture employs self-attention and cross-attention with parameter-only queries at various stages.
![]() | ||
Fig. 3 Data pre-processing steps and SASformer model overview. The input is a quotient transformed and discretized I(q), which is tokenized and position encoded resulting in a matrix X ∈ S×F, where S and F are the sequence and feature dimensionalities, respectively. A cross-attention layer then transforms this to a latent representation Z ∈ T×F with a smaller sequence dimension (T < S) for efficient processing by N subsequent self-attention blocks. The latent representation is then passed through two decoders (that are single cross-attention blocks), one that outputs a probability distribution over all scattering-model classes m and another that outputs a P-dimensional vector of continuous parameters for all scattering-model classes. Using an efficient variant of the transformer24 architecture called Perceiver IO,25,26 we developed a neural network to characterize small-angle scattering data into a scattering-model class and its parameters. We call our model SASformer. | ||
Averaged over all scattering-model classes, accuracy and top-3 accuracy are 0.95674 and 0.99257, respectively. If such a model is incorporated into SAS-data-analysis software, the exceptional top-3 accuracy suggests that proposing the 3 most probable scattering models is likely to be more informative to a user, since the top-3 classification predictions almost always contain the true scattering-model class. Inter-scattering model-classification results are presented in Table 1, where the values of the metrics make it clear that some scattering models are more difficult to predict than others. We observed that most of the errors result from misclassifying a scattering-model as a different but similar model from the same family – a group of related scattering-model classes. For example, when the input has a ground-truth class label of sphere, the model occasionally misclassifies these as belonging to either the core–shell sphere, ellipsoid, fractal (fractal-like aggregates of spheres) or fuzzy sphere classes. This is unsurprising, as these classes are all from the sphere family and hence represent similar scattering systems where the scatterers are slightly different forms of spheres. The analytic formulae of the scattering functions of these classes all share very similar form-factor components and as a result, produce scattering intensities with similar shapes and features that can be hard for a neural network (and even a human expert) to distinguish from each other. The same is observed for scattering models of the parallelepiped family (parallelepiped, rectangular prism, hollow rectangular prism thin walls), cylinder family (cylinder, core–shell cylinder, elliptical cylinder, flexible cylinder, hollow cylinder), etc.
| Scattering model | Acc ↑ | Top-3 Acc ↑ | Scattering model | Acc ↑ | Top-3 Acc ↑ |
|---|---|---|---|---|---|
| Adsorbed layer | 1.0 | 1.0 | Lamellar stack paracrys. | 0.994 | 1.0 |
| Binary hard sphere | 0.997 | 0.998 | Linear pearls | 1.0 | 1.0 |
| Broad peak | 1.0 | 1.0 | Lorentz | 1.0 | 1.0 |
| Core multi shell | 0.934 | 0.991 | Mass fractal | 1.0 | 1.0 |
| Core shell bicelle | 0.837 | 0.956 | Mass surface fractal | 1.0 | 1.0 |
| Core shell cylinder | 0.821 | 0.967 | Mono Gauss coil | 0.994 | 1.0 |
| Core shell ellipsoid | 0.888 | 0.969 | Multilayer vesicle | 0.987 | 0.996 |
| Core shell sphere | 0.777 | 0.977 | Onion | 0.894 | 0.968 |
| Correlation length | 1.0 | 1.0 | Parallelepiped | 0.888 | 0.98 |
| Cylinder | 0.906 | 0.984 | Peak Lorentz | 1.0 | 1.0 |
| Dab | 1.0 | 1.0 | Pearl necklace | 0.998 | 0.999 |
| Ellipsoid | 0.897 | 0.977 | Poly Gauss coil | 0.963 | 1.0 |
| Elliptical cylinder | 0.932 | 0.988 | Polymer excl. volume | 0.994 | 1.0 |
| Flexible cylinder | 0.966 | 0.984 | Porod | 1.0 | 1.0 |
| Flexible cylinder elliptical | 0.998 | 1.0 | Power law | 0.998 | 1.0 |
| Fractal | 0.938 | 0.993 | Raspberry | 0.994 | 0.999 |
| Fractal core shell | 0.879 | 0.938 | Rectangular prism | 0.79 | 0.982 |
| Fuzzy sphere | 0.896 | 0.999 | Sphere | 0.88 | 0.998 |
| Gauss Lorentz gel | 0.976 | 1.0 | Stacked disks | 0.964 | 0.992 |
| Gaussian peak | 0.997 | 1.0 | Star polymer | 1.0 | 1.0 |
| Gel fit | 0.978 | 0.999 | Surface fractal | 1.0 | 1.0 |
| Guinier | 1.0 | 1.0 | Teubner strey | 0.994 | 1.0 |
| Hollow cylinder | 0.911 | 0.969 | Triaxial ellipsoid | 0.917 | 0.991 |
| Hollow rect. prism thin walls | 0.957 | 0.998 | Two Lorentzian | 0.997 | 1.0 |
| Lamellar | 1.0 | 1.0 | Two power law | 0.976 | 1.0 |
| Lamellar hg | 0.988 | 1.0 | Unified power rg | 0.997 | 1.0 |
| Lamellar hg stack caille | 0.948 | 1.0 | Vesicle | 0.996 | 1.0 |
| Lamellar stack caille | 0.982 | 1.0 | — | — | — |
![]() | ||
| Fig. 5 Visual comparison of I(q) generated from SASformer's parameter predictions to the true I(q) for several binary hard sphere, flexible cylinder, lamellar stack caille, multilayer vesicle and rectangular prism scattering models. See Fig. 4 for more details. | ||
Quantitatively, the performance of SASformer in predicting the scattering-model parameters was assessed using mean absolute error (MAE), mean absolute percentage error (MAPE) and the coefficient of determination (R2) calculated on samples in the test set. In each case, MAE is presented in the units of the parameter that is stated in the SasView Model Functions documentation. Given that the ranges of the parameters vary substantially, MAE can not be used to compare the performance of SASformer on different parameters. MAPE is a unitless quantity that enables this inter-parameter comparison. Full descriptions and definitions of these metrics are presented in the ESI.† The standard deviation and interquartile range (IQR) of the absolute errors are also reported to show the variability in the predictive distributions of each scattering-model parameter. Compared to the standard deviation, IQR provides a better description of the spread when data are not normally distributed.
Tables S5–S8 in the ESI† show the quantitative regression metrics, stratified by scattering-model class, for each parameter. To assess SASformer's ability to accurately predict each scattering-model parameter, we decide that parameters with a MAPE less than 0.25 (i.e., 25%) and R2 greater than 0.6 are those that SASformer predicts reasonably well. Of the 219 scattering-model parameters in the SAS-55M-20k dataset, SASformer achieves these desired results on 100 parameters. Using stricter cutoffs of less than 0.1 for MAPE and greater than 0.9 for R2, 59 scattering-model parameters meet these criteria. From these results, we can conclude that although SASformer performs well on a reasonably sized subset of parameters, there is room for improvement. Limited performance on the remaining parameters may be due to a variety of factors. For instance, the relatively lower performance observed in multi-shell models, such as core multi-shell or onion, can likely be attributed to their inherently increased complexity, which arises not only from the substantial number of parameters within these models but also from the intricate interdependencies among these parameters. While SASformer may encounter difficulty in predicting some of the available parameters, it is not designed as a substitute for the least-squares fitting method used to determine these parameters in practical applications. Instead, SASformer is intended to aid in this fitting process by proposing parameter ranges to test, potentially making it a valuable tool despite its limitations.
• Quotient transform, as described in the methods section of this work, where we take the log of the quotient transform of the square of I(q).
• Scalar neutralization, which is the cumulative product of the quotient transform of the square of I(q). A logarithm follows the cumulative product.
• Zero-index normalization, where we first square then take the logarithm of I(q) and divide the entire resulting sequence by its zeroth index value.
While the quotient transform changes the shape of the input scattering intensity, scalar neutralization restores its original shape by application of the cumulative product, removing any scale factors in the process. Zero-index normalization, which also preserves the original function's shape, is arguably the simplest transformation, and was used by Molodenskiy et al.8 and Archibald et al.7 in their work. We omit a comparison to dimensionally-reduced representations which do not make use of the full sequence, such as those used in works by Franke et al.,6 Lutz-Bueno et al.19 and others cited in 1.1. Our approach utilizes a transformer-based neural network, which capitalizes on the strengths of such models in processing and learning from features across the entire signal sequence.
For each model trained using the three scale-invariant representations and evaluated on the test dataset, Table 2 shows (a) the average of the accuracies of each scattering-model class; (b) the median of the MAPEs of scattering-model parameters (MdMAPE). The quotient transform results in the highest accuracy and lowest MdMAPE, while both the quotient transform and scalar neutralization methods significantly outperform zero-index normalization on accuracy. Zero-index normalization significantly outperforms scalar neutralization in terms of MdMAPE.
| Acc ↑ | MdMAPE ↓ | |
|---|---|---|
| Quotient transform | 0.957 | 0.297 |
| Scalar neutralization | 0.934 | 0.365 |
| Zero-index norm | 0.891 | 0.326 |
All three input transformations studied in this work are scale invariant, which provides a solution to the aforementioned problems of units and arbitrary scalars in real SAS intensities. However, it is worth highlighting that there may be cases where some quantities that we would like to predict depend on the scale of I(q). In these cases, scale invariant transformations like the quotient transform are insufficient since scale information is removed completely from the input. This suggests the possibility that a lack of scale in the input may be responsible for the inability of our model to predict some of the scattering-model parameters in Tables S5–S8† with any accuracy at all (as shown by the large MAPE and near-zero R2 values for some targets). This is plausible, but we leave the investigation of this for future work, as this requires the development of either an entirely new input representation or extending the model to additionally take scale information as input (alongside a scale-invariant representation of I(q)). Meanwhile, the results of this work are compelling as evidenced by the good fits observed on many of scattering-model classes.
As previously mentioned, a ML model such as ours could be integrated into SAS-data-analysis software. When experimental SAS data are loaded into the software, the data could be pre-processed and passed as input to the SASformer model, which provides a prediction of the top three scattering models that most probably represent the data and their parameters. Additionally, a software library with a simple application programming interface (API) that provides a pre-trained version of SASformer could allow users with large numbers of SAS-data files to obtain predictions of structural information in a high-throughput manner. This would be particularly useful for analyzing a huge amount of data that would otherwise be too arduous to analyze manually.
While our method stands to work well for SAXS data, performance on SANS data is likely to be slightly worse in comparison. This is due to q-resolution smearing, which occurs due to the unique geometries of SANS instruments. Consequently, data collected on a single unique sample using different SANS instruments would result in different scattering intensities where the sharpness of features would vary with peaks and fringes being broadened. During training of our model, batches of SAS intensities are sampled randomly in each training step. These could be smeared by convolving each I(q) with random instrument smearing parameters at each step. This would make the model robust to data from different SANS instruments and would avoid the need to create intractably large datasets that cover the configurations of all SANS instruments. Our method was trained on SAS data without noise, and it is likely that performance on noisy data would be slightly worse as a result. This could be solved in a similar manner to the aforementioned SANS data issue, by adding noise to each SAS intensity as batches are sampled during training. This would make the model resilient to noisy inputs. Additionally, since the data in the SAS-55M-20k dataset do not contain structure factors, it is unclear how the model would perform when faced with a SAS intensity function with a structure factor component. In the future, the scope of our method could be extended to enable the prediction of scattering intensities of multi-component systems, such as a system containing spherical scatterers with proportion p and cylindrical scatterers with proportion 1 − p, thus making it more general. This would additionally enable the prediction of structure-factor models and their parameters, and inter-particle distance-distribution information could be obtained as a result. Finally, uncertainty quantification could be enabled by defining the model as a Bayesian neural network31 which would allow a distribution of predictions to constructed from multiple stochastic outputs of the model. Conversely, conformal prediction methods,32,33 which can be applied to any underlying point predictor given the assumption of data exchangeability, could be employed to produce prediction regions or intervals and provide a more nuanced understanding of the uncertainties associated with the predictions.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00225j |
| This journal is © The Royal Society of Chemistry 2024 |