Bayesian inference of protein ensembles from SAXS data

The inherent flexibility of intrinsically disordered proteins (IDPs) and multi-domain proteins with intrinsically disordered regions (IDRs) presents challenges to structural analysis. These macromolecules need to be represented by an ensemble of conformations, rather than a single structure. Small-angle X-ray scattering (SAXS) experiments capture ensemble-averaged data for the set of conformations. We present a Bayesian approach to ensemble inference from SAXS data, called Bayesian ensemble SAXS (BE-SAXS). We address two issues with existing methods: the use of a finite ensemble of structures to represent the underlying distribution, and the selection of that ensemble as a subset of an initial pool of structures. This is achieved through the formulation of a Bayesian posterior of the conformational space. BE-SAXS modifies a structural prior distribution in accordance with the experimental data. It uses multi-step expectation maximization, with alternating rounds of Markov-chain Monte Carlo simulation and empirical Bayes optimization. We demonstrate the method by employing it to obtain a conformational ensemble of the antitoxin PaaA2 and comparing the results to a published ensemble.


Introduction
Recent years have witnessed increased recognition of the ubiquity and importance of intrinsically disordered proteins (IDPs) and multi-domain proteins with disordered intra-domain linker regions (IDRs). 1-5 Long unstructured regions can be found in more than half of eukaryotic proteins and at least 25% are completely disordered. 6 It is becoming evident that structural plasticity plays an important role in the function of biological macromolecules, e.g. in areas such as transcription regulation, cell signaling, and the function of chaperones. 1,7,8 Misfolding and aggregation of IDPs are associated with many human diseases, such as Alzheimer's and Parkinson's. 9,10 These flexible proteins comprise dynamic systems that explore a conformational space that cannot be adequately described by a single state, but requires an ensemble of conformations.
Small-angle X-ray scattering (SAXS) and nuclear magnetic resonance (NMR), as solution structure methods, are wellsuited to characterize structural ensembles. SAXS, in particular, is a powerful technique, yielding averaged, low-resolution structural information across multiple spatial orders of magnitude. Combined with appropriate ensemble-based computational methodology, it could allow for the characterization of IDP and IDR flexibility not accessible through NMR spectroscopy or X-ray crystallography alone. 11,12 Current computational methods aim to recover a representative ensemble as a subset of conformations from a large pool of candidate structures, based on experimental SAXS data. [11][12][13][14] The initial pool of structures is generated from either knowledge-or physics-based models. A common assumption in these approaches is that the structural ensemble can be represented accurately by a weighted average of discrete conformations. Small sets of conformers are typically used as an approximation, 15 in order to avoid overfitting and to reduce the computational load. The Ensemble Optimization Method (EOM) uses a genetic algorithm with a predefined number of structures of equal weight for ensemble selection, 16 while the improved EOM 2.0 optimizes individual weights together with an ensemble size within a customizable range. 12 Minimal Ensemble Search (MES) uses a genetic algorithm on a population of ensembles of sizes between 2 and 5 structures. 17 In the Basis-Set Supported SAXS (BSS-SAXS) approach, conformations are assigned to a small number of clusters, first by RMSD and then by scattering pattern similarity, after which a Bayesian MC algorithm is used to determine the cluster weights. 18 The Ensemble Refinement of SAXS (EROS) method similarly uses RMSD clustering followed by maximum entropy 19 cluster weight optimization. 20 In the program ENSEMBLE, a predetermined number of conformations is employed, with either equal or varied weights, and the ensemble is optimized using axial descent or simulated annealing algorithms. [21][22][23][24] The Sparse Ensemble Selection (SES) method reformulates the ensemble selection problem as a linear leastsquares problem that optimizes the weights of all structures in the initial pool, yielding a sparse ensemble of conformations. 25 Many of these approaches limit the ensemble size explicitly while others, e.g. BSS-SAXS and SES, use sparsity-inducing algorithms. However, in flexible systems, such as IDPs and IDRs, a small number of conformations may not adequately explain the data. 25 In contrast, a number of methodologies that have been applied to NMR data eschew reweighing of structures in favor of probabilistic sampling according to the maximum entropy principle. 15,[26][27][28][29][30][31][32] In this manner, an ensemble-based description is obtained that balances the experimental data with prior information, typically encoded in a force field.
Here, we approach SAXS data in a similar manner, resulting in a new method for inference of structural ensembles, called Bayesian Ensemble SAXS (BE-SAXS). BE-SAXS combines a generative, fine-grained (i.e. atomic-level) model of protein structure with experimental SAXS data. Through an iterative expectation maximization (EM) algorithm the method adapts a prior distribution concerning protein structure in atomic detail to match the SAXS ensemble average, within the experimental uncertainty. The resulting posterior distribution takes the ensemble nature of the data into account and correctly balances information present in both the force field and the experimental data. The number of model parameters depends only on the number of experimental observables and representative structures can be sampled a posteriori. Furthermore, since conformations are not restricted to a subset of an initial pool of structures, bias attributable to the initial selection process and limited sampling is avoided.
We apply the BE-SAXS method to SAXS data for the flexible antitoxin PaaA2 and show substantial agreement between the recovered distribution of conformations and the published structural ensemble of the protein. These results illustrate the utility of the method in elucidating the flexibility of partially-or fully-disordered proteins.

Inferential structural ensemble determination
In probabilistic inferential structure determination (ISD) the goal is to establish a posterior distribution p(x|d,r 2 ) of protein conformations x, given some experimental data d with experimental errors r 2 . 33 The classic ISD approach assumes that the experimental data represent a single conformation. Consequently, application of the method to disordered systems, which are characterized by highly heterogeneous ensembles, may give misleading results. 27 Such flexible systems require an ensemble-based inference method.
SAXS experiments measure the temporal (i.e. over the measurement duration) and ensemble average of the X-ray scattering from all orientations and conformations of the proteins in a solution. Therefore, d is a noisy observation of the true ensemble average e of the scattering f for each individual conformation of a protein. f is a lower-dimensional projection, or coarse-grained representation, of the fine-grained variable x, through a deterministic function, f h(x). A model for such ensemble-averaged data was previously expressed as a Bayesian network and applied in the context of NMR data. 27,28 It gives rise to the following posterior distribution over the coarse-grained variables: This coarse-grained probabilistic model is then combined with the prior distribution of the fine-grained variable x, according to an appropriate probabilistic prior model M, using the reference ratio method (RRM). 34 The RRM is based on the principles of probability kinematics, a variant of Bayesian updating that can be used to modify a given probability distribution in the light of new evidence regarding partitions of the distribution's sample space. 35 The updated posterior is: This combined posterior is the distribution with minimum Kullback-Leibler divergence from the fine-grained prior p(x|M), under the requirement that the marginal distribution of the coarse-grained variables follows eqn (1). 36

SAXS ensembles
In the case of SAXS, the experimental data d and the ensemble average e constitute vectors of scattering intensities, while the structures x are represented as vectors of atomic coordinates. A force field or a fragment library could be used to sample from the prior distribution p(x|M); here, we use the PROFASI force field. 37 A coarse-grained vector f is generated through a forward model by approximating the scattering function h(x) with the Debye formula, which holds for spherical scatterers: 38 where q = (4p sin y)/l is the momentum transfer, with scattering angle 2y and wavelength of the X-ray beam l. F i (q) is the atomic form factor for atom i, r ij is the distance between atoms i and j, and K is the number of atoms in the structure. The X-ray scattering factors are calculated using a linear combination of Gaussians fit to empirical data. 39 Posterior distribution. We use a Gaussian distribution for the likelihood, p(d|e,r 2 ), to relate the data to the ensemble average e. For the ratio of the two unknown distributions p(f|e) and p(f|M) in eqn (2) we use a log-linear model G fje; B ð Þwith a link function l(B,e) = Be À1 , 40 where B is a diagonal matrix and Z is a normalization constant. The matrix B serves to match the first moment, hfi, of the coarse-grained prior represented by the PROFASI force field to the ensemble average e. This model is scale-invariant when f and e are scaled together, i.e. G fje; B ð Þ¼G cfjce; B ð Þfor any constant c. This is required due to the arbitrary scale of SAXS data.
Assuming a uniform prior for e, the joint posterior distribution from eqn (2) for SAXS ensembles becomes: In the last term, E prof is the energy of the PROFASI force field and b 1/kT, where T is the temperature and k is the Boltzmann constant.
Determining B. We modify the EM algorithm described by Olsson et al., 28 to estimate the matrix B (Fig. 1). This corresponds to adopting an empirical Bayes strategy for the prior distribution of the ensemble posterior.
In the E-stage of iteration k of the algorithm, a Markov chain Monte Carlo (MCMC) simulation, as implemented in the PHAISTOS framework, 41 produces N samples S ðkÞ ¼ f 1::N ; e 1::N ; x 1::N f g from the posterior p(e,f,x|d,r 2 ,B (k) ). The result is a conformational ensemble of structures together with their forward-computed SAXS profiles, whose average optimally matches the experimental data. The iterative algorithm is initialized with the zero matrix, B (0) = 0, resulting in an unrestrained simulation with the structural prior, exp(ÀbE prof (x)).
A new scaling matrix B (k+1) is estimated in the M-stage, by minimizing a w 2 EM objective function: with: Conceptually, the M-stage aims to ensure that a given ensemble average e and the matching coarse-grained average of the sampled structures hfi coincide. It is necessary to normalize by the experimental errors in eqn (7), since SAXS data ranges over several orders of magnitude across the scattering profile. The role of the second term is to use Tikhonov regularization to avoid overfitting. 42 Here, it is utilized specifically to avoid excessive changes to the matrix B due to finite sampling issues, allowing for monotonous convergence of the parameters.
The expectation of the coarse-grained variable, , is estimated from the N samples using importance sampling: 43 It is notable that the importance weights in eqn (8) do not change when f and e are scaled together. In practice, both the coarse-grained vector f and the ensemble average e are brought to scale with the experimental data d -the former through a scaling coefficient determined at initialization, and the latter through the Gaussian ensemble likelihood. Therefore, the matrix B (k+1) and the associated structural ensemble produced by the algorithm remain invariant, regardless of the absolute magnitude of d.
The expectation of the ensemble average is approximated by the sample average: For further details see the work of Olsson et al. 28 We use the basin hopping stochastic global optimization algorithm 44 for the minimization of the objective function in eqn (6); however, other optimization techniques such as genetic algorithms or parallel tempering may be utilized. In principle, because the function is convex, gradient descent algorithms are also applicable but we found that they can be unstable due to finite statistical sampling. Convergence can be considered achieved once the objective function falls below 0.5, indicating incremental improvements within the experimental uncertainty of the data.

Simulations
Experimental data. We utilized the published conformational ensemble of the disordered protein PaaA2 in order to test the BE-SAXS ensemble method. 45 PaaA2 is an antitoxin that is encoded by a toxin-antitoxin module in Escherichia coli O157. 46 In the absence of its binding partner, the toxin ParE2, PaaA2 behaves like an IDP. However, it contains two stable a-helical regions that are flanked by highly disordered stretches of amino acids. 45 The published structural ensemble of PaaA2 consists of 50 conformations and is available from the PDB database under the code 3ZBE. The structures were selected by the application of a jackknife procedure to EOM-derived SAXS ensembles from a pool of NMR-restrained conformers. 45 Following the Reference Ensemble Method, 47 in order to validate the BE-SAXS algorithm we used a SAXS forward model to create a synthetic data set from the reference ensemble of 50 conformations. This allows controlling for all sources of uncertainty in the evaluation. We constructed the SAXS ensemble average data d for the protein by generating SAXS profiles d i for each conformation, using the FoXS program, 48 and averaging the individual profiles: Experimental errors r 2 were assigned as the population variance of the data.
Computation. The EM algorithm ran for a total of 21 iterations. In each E-stage, the PHAISTOS framework was used to run 64 independent MCMC chains for 10 6 steps. 41 Samples S ðkÞ were saved every 10 3 steps to be used in the M-stage, after a 40% burn-in. The global optimization algorithm of the M-stage was run for up to 20 independent iterations, or until a stable solution was found. The algorithm reached convergence at iteration 10, as judged from the change in fit between EM steps, w 2 EM , from the ensemble SAXS profile fit, w 2 SAXS , and from the magnitude of the changes in the scaling matrix B. The measure of fit to the experimental data was defined as: where hfi is the ensemble average: The generative probabilistic models TorusDBN and BASILISK were used as proposal distributions during the MCMC simulation for main chain and side chain moves, respectively. 49,50 The introduced bias was subsequently removed. The PROFASI force field at T = 300 K was used as the prior distribution of the structures x. 37 GPU calculations. The forward calculation of the SAXS profile is the most compute-intensive part of the BE-SAXS ensemble method. We used our GPU Parallel Page-Tile SAXS algorithm with atomic form factors to accelerate the computation of eqn (3). 51, 52 We utilized a 16-core Intel Xeon E5-2660 server with 2 NVIDIA GeForce GTX 690 GPU cards (4x1536 GPU cores), which allowed us to run the 64 MCMC chains in parallel.
To accelerate the M-stage, we implemented an OpenCL kernel that calculates eqn (8) on the GPU. 53 The efficiency of this approach depends on the number of samples used; for this simulation, the GPU acceleration reduced the stage time by a factor of 3.
Ensembles. The structural ensembles for each EM iteration (EM i , for i = 0,. . .,20) were generated by uniformly sampling conformations from the 64 independent MCMC chains at 10 4 MC-step intervals, after a 40% burn-in. This resulted in 3904 structures per iteration. 128 structures were sampled uniformly from EM 0 and EM 9 in order to visualize the ensembles.

Algorithm convergence for PaaA2
In the E-stage of the first iteration of the BE-SAXS algorithm, the conformational ensemble EM 0 of the protein PaaA2 was effectively sampled from an unrestrained PROFASI force field. The resulting ensemble average does not fit the SAXS scattering profile well, as evidenced by the high value of the w 2 SAXS measure (Fig. 2). This suggests that PROFASI alone, as a minimalistic force field, does not accurately capture the details of the flexibility of PaaA2 represented in the calculated ensembleaveraged SAXS data. In subsequent iterations, however, the fit improves rapidly and reaches a stable region. The objective function, w 2 EM , also reaches a low value quickly and falls below 0.5 in iteration 9 (Fig. 2). At this level, by the nature of w 2 EM , modifications to the matrix B produce changes in the importance sampling approximating distribution that are within the experimental uncertainty of the data. The individual coefficients of B also stabilize at iteration 9, further indicating convergence. The equilibrium reached thereby is dynamic, due to the stochastic nature of the basin hopping global optimization algorithm used in the M-stage, combined with the underdetermined optimization problem in eqn (6).
Convergence in the BE-SAXS algorithm has to be evaluated comprehensively, by examination of both w 2 EM and w 2 SAXS , since a low w 2 EM does not guarantee that the conformational ensemble provides a good fit to the data. If there is an insufficient number of steps in the E-stage to allow for the MCMC to reach equilibrium, then the Boltzmann distribution will not be sampled successfully. Thus, a low w 2 EM could be achieved at a specific iteration and still result in a B matrix that does not produce an ensemble average matching the experimental data. Furthermore, it is necessary to examine the behavior of the w 2 statistics and the B coefficients over a range of EM iterations, to determine if an equilibrium has in fact been reached. Because the optimization problem in eqn (6) is underdetermined, fluctuations in both the matrix B and w 2 SAXS are expected. However, in order to assume convergence, these fluctuations should be confined to a stable and relatively narrow region.

BE-SAXS restrains the PaaA2 ensemble
We examined and compared the EM 0 and EM 9 structural ensembles of the protein PaaA2, in order to evaluate the performance of the BE-SAXS method. The scattering average for the initial, unrestrained ensemble EM 0 exhibits a poor fit to the SAXS profile, d, (w 2 SAXS = 65.0) while the average for the restrained ensemble EM 9 shows good agreement with the data (w 2 SAXS = 0.9), within the margins of error (Fig. 3). The high q range of the SAXS profile contains atomic-level data and the larger deviation observed there could be due to the stronger influence of the PROFASI force field on the local structure of the simulated IDP protein than on the overall shape. While the deviation is within the error bounds, it may be desirable to further penalize discrepancies within this range during the M-stage optimization. Alternatively, better sampling of the local structure could be achieved by a longer simulation that emphasizes local and side chain moves. This may allow for a more accurate assessment of the agreement between the ensemble averages of the target and approximating distributions in the M-stage.
To further characterize the EM 0 and EM 9 ensembles, we compared their radius of gyration (R g ) distributions to the R g distribution of the published PaaA2 reference ensemble (Fig. 4). The 50-structure 3ZBE ensemble is relatively compact, while the unrestrained PROFASI-driven EM 0 exhibits a wider variation of R g with two prominent modes. On the other hand, the SAXSrestrained EM 9 closely matches the original ensemble in both its mean and sample error, suggesting that BE-SAXS is able to extract ensemble-level R g information from the SAXS profile.
Due to the low information content of SAXS data, it is not possible to summarize the ensemble using only a few representative conformations, despite the presence of a force field. However, the scattering profile can inform about the general shape of the protein. Taking advantage of the stable a-helical regions in PaaA2, we defined a shape descriptor, K sh , as a proxy to the 3-dimentional shape. The K sh measure is calculated as the ratio of the distances between the distal and proximal ends of the two helices (the Ca atoms of residue pairs (16,57) and (28,42), respectively); thus K sh is an indicator of the ''openness'' of the overall structure. We compared the distributions of the descriptor for the EM 0 , EM 9 , and reference ensembles (Fig. 5). The unrestrained EM 0 gives rise to a bimodal distribution for K sh and favors open structures. The shape descriptor distributions for the reference ensemble and the SAXS-restrained EM 9 show substantial similarity to each other, and share a propensity for more compact structures.   45 (black) and the ensembles at EM iterations 0 (blue) and 9 (red). The distribution for 3ZBE was derived through kernel density estimation, due to the limited number of conformations. Fig. 5 Comparison of the distributions of the shape descriptor, K sh , for the 3ZBE ensemble reported by Sterckx et al. 45 (black) and the ensembles at EM iterations 0 (blue) and 9 (red). The distribution for 3ZBE was derived through kernel density estimation, due to the limited number of conformations.
The ability of the BE-SAXS method to restrict the solution space to areas consistent with the experimental data is further evident in the visualized ensembles (Fig. 6). EM 9 exhibits characteristics similar to the reference ensemble -it favors conformations in which the two a-helices are packed closely together, while maintaining significant overall flexibility. At the same time, the unrestrained EM 0 comprises structures that are consistent with uniform rotation around the disordered linker. The linker flexibility is greater in EM 9 than in EM 0 , with more diversity in the relative orientations of the two helices, as in the original ensemble.
The peripheral disordered regions in both EM 0 and EM 9 exhibit much more helical structure than the 3ZBE ensemble. This is likely the effect of the PROFASI force field on local structure and it helps explain the larger deviation of the scattering profile at high q values. The main advantage of PROFASI is efficiency, but a more sophisticated force field would presumably produce a better fit with the data.

Conclusions
A novel method for inference of protein ensembles from SAXS data, which we call Bayesian Ensemble SAXS, was described and demonstrated here as a proof of principle. BE-SAXS proceeds through successive expectation maximization steps and uses a Bayesian probabilistic model for ensemble-averaged SAXS data to modify a probabilistic model of protein structure, in agreement with an experimental scattering profile. This results in a generative model that can be used directly to characterize a protein's conformational ensemble, or that can be further restrained with other types of experimental data, such as NMR. The generative approach offers a particular advantage for flexible systems, such as intrinsically disordered proteins and proteins with long disordered regions, since it does not impose restrictions on the ensemble size and allows sampling of the full conformational space allowed by the data. The number of parameters of the generative probabilistic model only depends on the number of experimental observables, and not on the size of the ensemble. This stands in contrast to many existing SAXS ensemble methods that fit a set of structures to the data and where each replica results in a linear increase in the number of parameters.
To illustrate the BE-SAXS method, we applied it to the ensembleaveraged SAXS data for the published conformational ensemble of the highly flexible antitoxin PaaA2. We showed that our approach restrains the conformational space accessible to the protein simulation and yields ensembles with characteristics consistent with the original set of structures. The ability of the method to model protein flexibility suggests its utility in characterizing other IDPs and multidomain proteins. The Bayesian probabilistic formulation used here can be complemented by other probabilistic models based on experimental observables. In particular, NMR residual dipolar couplings (RDCs) and chemical shifts are commonly utilized in the context of disordered proteins. 29,54 We expect that employing BE-SAXS in concert with methods that make use of other experimental data, can greatly help elucidate the native state ensembles of flexible macromolecular systems.   9 , the SAXS-restrained ensemble at iteration 9 of BE-SAXS. All structures are aligned on the first helix (colored in cyan). The color of the second helix corresponds to the R g of the structure in Å (indicated in the color bar).