Testing the Transferability of a Coarse-grained Model to Intrinsically Disordered Proteins

ae The intermediate-resolution coarse-grained protein model PLUM [T. is used to simulate small systems of intrinsically disordered proteins involved in biomineralisation. With minor adjustments to reduce bias toward stable secondary structure, the model generates conformational ensembles conforming to structural predictions from atomistic simulation. Without additional structural information as input, the model distinguishes regions of the chain by predicted degree of disorder, manifestation of structure, and involvement in chain dimerisation. The model is also able to distinguish dimerisation behaviour between one intrinsically disordered peptide and a closely related mutant. We contrast this against the poor ability of PLUM to model the S1 quartz-binding peptide.


Intrinsically disordered proteins
Intrinsically disordered proteins and peptides (IDPs) are amino acid sequences which lack a static and stable native structure, and therefore differ from classical ordered proteins. IDPs are abundant in nature and often derive function from their disorder. 1 IDPs are defined by their inability to fold into a unique and stable tertiary structure, and this term is preferable to the early term IUPs (intrinsically unstructured proteins) which falsely suggests a complete lack of structure. 2 Despite the presence of intrinsic disorder in 15-45% of eukaryotic proteins, 3 the study of IDPs is a nascent field. The classical structure-function paradigm of proteins emerged from the proposition that protein denaturation is purely a conformational change, 4 and states that 3D protein structure determines its function, therefore, all functional proteins require a stable native state. The cataloguing of thousands of functional native protein structures during the rest of the 20th century 5 solidified the notion of 3D structure being a prerequisite of function.
Although evidence of conformational adaptability 6,7 and functional disordered regions (numerous examples cited in ref. 8) appeared during the second half of the 20th century, it took until the turn of the millennium for researchers to formally argue for function in natively disordered proteins, [9][10][11][12] thereby casting doubt on the universality of the structure-function paradigm.
Most popular coarse-grained protein models include a bias towards a native structure by including terms describing interactions between backbone atoms, as in 'Gō-like' models, 23 or by fixing the secondary structure as in MARTINI and others. [24][25][26] These approaches render a model inadequate for studying conformational changes of secondary structure. By the same token, this is not a suitable coarse-graining approach for IDPs. Conversely, models which give an almost all-atom description of the backbone by modelling each of C a , C 0 and N explicitly can aim to sample secondary structure organically, thus offering a better hope of understanding IDPs. [27][28][29][30][31] PLUM 27,32 is one such model, and the present work examines its ability to realistically simulate a class of IDPs involved in biomineralisation.

Biomineralisation and the n16N peptide
Biomineralisation is the process of mineral formation in controlled environments by living organisms. Biomineralisation can occur in an extracellular environment, where it is regulated by a macromolecular complex 33-35 comprising proteins, polysaccharides or glycoproteins, 36 with high levels of disorder. Indeed, biomineralisation proteins have been called the most disordered functional class in the protein world. 37 The function of molecular assembly, required of complex-forming proteins, may benefit greatly from disorder. Complexes may assemble in multiple stages, making use of conformational adaptability to refold and to overcome steric barriers. Additionally, IDPs could exhibit selectivity over the chemical environment in which assembly is possible. 38 n16 is a family of 108AA (amino acid chain length) ''aragonite promoter'' proteins, 39,40 named after their presence in nacre and their molecular weight in kDA. 23 polymorphic variants have been identified, all actively expressed in pearl oyster ( pinctada fucata), 41 while homologues of n16 have been found in other molluscs. [42][43][44] The 30AA N-terminal sequence of n16 shown in Fig. 1, named n16N, has been studied experimentally in some detail [45][46][47][48][49][50][51][52][53] and has been called ''the key self-assembly/aragonite forming domain''. 53 The addition of a b-chitin substrate, an extremely abundant polymer with a rigid crystalline structure which is present in the in vivo system, 54 has been shown to enhance n16 and n16N's biomineralisation activity. 49,51,55 An atomistic simulation study of n16N under aqueous conditions, using the Replica Exchange with Solute Tempering (REST) 56,57 approach, provides a basis for comparison with work carried out with coarse-grained models. 58 The ability of the REST approach to access observables that are otherwise challenging to obtain from molecular simulation was recently demonstrated with the near-reproduction of the experimentally determined binding free energy of the AuBP1 peptide adsorbed at the aqueous Au interface. 59 n16N subdomains with the capacity to perform different functions in the context of the three-component n16N/b-chitin/ calcium carbonate system have been proposed. 58,60 These are described in Table 1 and shown in Fig. 1. SD1 and SD2 are rich in tyrosine residues, which are hypothesised to have roles in intra-and inter-peptide stabilisation via ring-ring and hydrogen bond interactions, and lead to SD1 and SD2 being less flexible than SD3. SD3 is a highly charged region which may be the mineral assembly subdomain; the mechanism of ion capture, by such a flexible, charged polymer chain, has been termed ''fly-casting''. 61 A mutant of n16N referred to as n16NN is produced by mutating the acidic residues with their charge-neutral counterparts (Asp -Asn, Glu -Gln). This greatly hampers the aragonite selectivity of the peptide. 49,51 It is confirmed that Asp and Glu do have an active role in organic-mineral association 48 and that these substitutions abolish n16N's ability to form complexes with Ca 2+ . 47 n16NN has been shown to self-assemble in an aberrant manner 47 or not at all, 50 suggesting that a simulation without mineral ions may suffice to reveal significant differences between n16N and this mutant.
In the present work, systems of n16N and n16NN will be simulated in one-unit and two-unit systems in order to evaluate PLUM's applicability to this IDP. As we cannot expect models at this resolution to provide new insight into the specific chemical nature of inter-peptide stabilisation, we limit our analysis to the ability of PLUM's effective interactions to correctly capture secondary structure statistics.
1.3 The S1 peptide and the proline residue S1 is a bioinformatics-designed 12AA peptide, named as a contraction of 'strong-binder one', after its place as the first peptide designed by its authors to bind strongly to quartz. 63 Circular dichroism (CD) spectral analysis of the peptide is indicative of a significant degree of polyproline II character, and this result was replicated by atomistic replica exchange molecular dynamics (REMD) simulation of the peptide in solution. 64 The sequence of S1 is PPPWLPYMPPWS.
Proline (P) is a unique residue. Its side-chain is bonded cyclically to both the C a and N backbone atoms, in a 5-membered ring. This makes the side-chain's structural properties unique, limiting the f dihedral angle to approximately À601, removing the preference for trans-isomerisation, and preventing the residue's nitrogen from participating in hydrogen bonding.
The PLUM model features a proline residue which cannot hydrogen bond through its nitrogen and has its own o dihedral angular potential which is bimodal, facilitating both trans-and cis-isomerisation. However, there are no further provisions. In particular, the polyproline II structure is stabilised by P-P side-chain interactions, while the PLUM model is developed for the more common case of steric inhibition and hydrogen bonding being the main drivers of secondary structure. 27 We include S1 within our testing to assess the limitations of the PLUM model in this context.

Methods
The implicit-solvent PLUM model, introduced by Bereau and Deserno, 27 is designed to model protein folding and aggregation. Its goal is to be predictive where the protein structure is ''not known, not well defined, strongly perturbed from the native state, or adjusts during aggregation events''. 27 The PLUM model represents the backbone with near-atomistic resolution, with beads for NH, C a H and C 0 O united atoms. NH and C 0 O beads can hydrogen bond through a directional potential which depends on implicit positions of hydrogen and oxygen atoms within these beads. The side-chains are represented by single beads interacting via a simple hydrophobic scale, which may also capture other effects, based on previous crystal structure work. 65 Simulations of this model were carried out in the LAMMPS simulation package 66 (http://lammps.sandia.gov), modified to support PLUM-specific pair potentials. All interaction parameters  Table 1. Cationic amino acid residues shown in bold blue, anionic residues shown in bold red. Intrapeptide stabilisation. SD2 9 to 16 Clustering role due to interpeptide Y-Y interactions. 60 b-Sheet forming capacity. 62 Intrapeptide stabilisation. SD3 17 to 30 Greatest conformational accessibility, highly charged; proposed ''fly-casting'' mechanism in ion capture. 61 were taken from Bereau and Deserno's original PLUM paper. 27 A timestep size of 3 fs was used and a Langevin thermostat was applied with damping parameter 1000 fs. We stress that the unit of time in PLUM and other coarse-grained models is not well-defined. The notional time units used here are calculated in terms of the energy, mass and length scales defined by the model. However these do not map directly onto real time and should be considered a measure of sample size only.
In order to extensively sample the ensemble of conformations available to each peptide, we employ the enhanced sampling technique replica exchange molecular dynamics (REMD). 67,68 REMD simulations have M non-interacting systems with the same Hamiltonian, in heat baths at a range of temperatures. The highest temperatures have the greatest ability to overcome potential energy barriers. By regularly proposing swaps of coordinates and scaling of momenta between replicas, and accepting these proposals according to the Metropolis prescription, the enhanced sampling at the highest temperatures can propagate to the lower-temperature ensembles without violating canonical ensemble statistics. 69 In the current work, we verify sufficient sampling of the whole temperature range by measuring the average round trip time across the range and ensuring the simulation time is far greater.
We use a clustering analysis to group geometrically similar conformations, producing a list of clusters ranked by their frequency of occurrence over the course of a trajectory. The g_cluster tool, available as part of the Gromacs package, [70][71][72] and the gromos clustering algorithm, 73 were used. Structures are candidates for grouping if they fall within a specified cut-off limit measuring the root-mean-square deviation (RMSD) between their atomic positions. The RMSD cut-off is a chosen parameter whose best value depends on the goal of the analysis and the system being studied, with larger systems generally requiring a larger cut-off to group similar manifestations of structure. Full chains of n16N or n16NN were analysed with a cut-off of 0.4 nm. Two-chain systems used a cut-off of 0.6 nm. Region-wise cluster analyses were carried out on the chain for residues 1 to 8, 9 to 16 and 23 to 30 with an RMSD cut-off of 0.2 nm. These residue ranges were deliberately equal-length representations of SD1, SD2 and SD3, defined in Table 1. 'Core' regions of the two-chain system, used in Section 3.4, had a cut-off of 0.3 nm, and consisted of the regions representing SD1 and SD2. In all cases, backbone atom positions were used for analysis, without sidechain beads.
Cartoon-style images of peptide conformations were produced in the NGL viewer 74 for the single-chain system and the VMD viewer 75 for the two-chain systems. As standard automatic structureidentifying algorithms cannot be applied to PLUM model proteins, the SABBAC 76 online tool was used for the monomer, and manual assignment based on backbone dihedral angles and hydrogen bonding was used for the dimers. The data collected at 300.0 K were analysed. Based on comparison to available atomistic data, 58 the simulation showed marked over-stabilisation of the a-helix secondary structural motif over the entire length of the chain. Analysing 11 400 trajectory snapshots for geometric likeness, according to the algorithm described in Section 2, showed that the structure reproduced in Fig. 2 was remarkably dominant, with a cluster population of 48.9% of the snapshots. Fig. 3 is a Ramachandran plot which reinforces the notion of a-type structures dominating the peptide's behaviour in PLUM.

Overstabilisation of the a-helix in the original PLUM model
These data suggested that the original PLUM model is too strongly biased toward stable secondary structure to accurately reproduce the conformational ensemble of this IDP. The a-helix, and other common motifs, are principally stabilised by the strong energetic favourability of hydrogen bonding. Therefore, we have repeated the above simulations with minor alterations to the strength of the PLUM backbone-backbone hydrogen bond interaction strength parameter e HB . The resulting occupancy of each quadrant of the Ramachandran plot was plotted and compared to the atomistic REST trajectory, captured at 300.0 K in CHARMM22*. 58 These data are presented in Fig. 4. Fig. 4(b) shows that a decline in preference for a-helix structure occurs when the e HB parameter value is reduced, tracking the decline in lower left quadrant occupancy shown in Fig. 4(a). The PLUM output is very sensitive to adjustments, and reaches peak similarity to the atomistic data with a decrease of about 5%. However, when the level of structure is stratified by subdomain, it emerges that the PLUM model does not match the atomistic model in regional ranking by a-helicity.
Based on this study, we have adopted an e HB value set to 94.5% of the original; this will be henceforth referred to as the PLUM* model. Before moving forward, the PLUM* model was tested against validation systems used by the original PLUM authors to check for unexpected changes. The de novo designed peptide 2A3D 80 and the 15-unit GNNQQNY peptide system 81 were observed to fold and aggregate normally, the only change being the expected decline of approximately 5% in the transition temperature to disorder. At present, we regard this modification as specific to the n16N system. The tendency to exaggerate stability of secondary structure motifs in IDPs may be a general feature of PLUM-like models, however the sensitivity to e HB demonstrated in Fig. 4 suggests that the specific adjustment used here is unlikely to be transferable. Similar calibration against atomistic simulation may be required on a system by system basis.

n16N in PLUM*
The simulation protocol of Section 3.1 was repeated in the PLUM* model for the n16N peptide. The results at 300.0 K revealed that the retuning was sufficient to bring PLUM into alignment with CHARMM22* on multiple measures of structural properties.
An identical clustering analysis led to the top structure given in Fig. 2 falling in popularity from 48.9% to 4.4%. 1593 clusters of geometrically similar structures arose, compared to the previous experiment's count of 453, and the distribution was far flatter, with top four percentages of 4.4%, 3.4%, 3.0% and 2.4% compared to 48.9%, 10.3%, 4.1% and 3.7%. This implies that n16N is far more disordered in PLUM* than PLUM.
Region-wise cluster analyses were carried out on the chain for equal-length representations of each subdomain, defined in Table 1. All regions produced 32 clusters. The SD1 segment's top three clusters are populated with 41.3%, 26.2% and 9.5% of trajectory snapshots, SD2's with 48.1%, 20.2% and 9.2%, and SD3's with 33.3%, 20.7% and 15.7%. This disparity implies that SD3 possesses the greatest conformational accessibility, in agreement with the fly-casting hypothesis and with previous atomistic results. 58 Fig . 5 shows the Ramachandran heat map for this simulation, broadly showing greater disorder in secondary structure. Fig. 6(a) and (b) break down the comparison of secondary structure by specific named structural regions. The majority of segments match well in (a) and (b), but PLUM* has greater g-structure and other structure, at the expense of PPII structure.
In Fig. 7, secondary structure is compared to the atomistic data on a per-residue basis. The results are promising, indicating that the PLUM* model has a good ability to select between the primary options; a-structure or b-like structure, at the level of individual peptide bonds. Strikingly, both top left quadrant lines are punctuated by two valleys centred on glycines, fluctuating about 0.6 otherwise. Each a line hits a minimum around SD2's I residue, but the disagreement in a-helicity between each subdomain is clear.
Moving away from secondary structure, further support for the subdomain hypothesis and for PLUM*'s ability to distinguish chain regions is presented in Fig. 8. Here, side-chain interaction frequencies are examined for the single chain, and the result bolsters the subdomain hypothesis, showing an island of SD1-SD2 interactions, while SD3 is isolated.

n16NN in PLUM*
The simulation parameters used for n16N were used again on the mutant peptide n16NN, and the 300.0 K trajectory was analysed.
The top geometric clusters were similar to those found for n16N; the structure in Fig. 2 remained the most frequentlyoccurring with an increased population of 8.3%. The increased Fig. 3 Ramachandran plot showing the exploration of (f,c) space for a single unit of n16N at 300.0 K. The dominant peak represents a-helix structure, while the secondary peak represents a L -helix structure, the first peak's enantiomer.  58 with the CHARMM22* model 77,78 in TIP3P water. 79 The occupancy of the alpha-helix-dominated bottom left quadrant drops from approximately 60% to 30% when e HB ; is reduced to 94.5% of its original value. stability is an expected result of removing negatively-charged residues. Fig. 6(b) shows that a move towards greater stability of a-structure, at the expense of most other structural forms, is representative of the whole ensemble. This points towards the conclusion that the charged residues had a role in ensuring the peptide could thermally access a relatively large number of conformational states. However, region-wise cluster analyses show no clear trend towards greater local conformational accessibility, suggesting that the difference lies in the characteristics of the whole peptide. SD1's top clusters had populations at 40.6%, 27.8% and 8.1% of trajectory snapshots, SD2's at 45.9%, 20.0% and 15.9%, and SD3's at 28.1%, 19.0% and 18.0%.

Two units of n16N and n16NN
REMD simulations of two-chain systems of n16N and of n16NN were carried out; these will be denoted n16N-2 and n16NN-2. n16N-2 simulations ran for 5.1 ms in PLUM* and PLUM, and the n16NN-2 simulation in PLUM* ran for 6.4 ms.  Fig. 9 shows the Ramachandran heat maps for the dimer systems. In the PLUM* model, secondary structure manifestation has changed greatly between n16N and n16N-2. aand a Lstructure has been replaced by b-structure, as an emergent result of the peptide's multiplicity. Fig. 9(b) shows that the original PLUM model has not allowed new behaviour to emerge from n16N, compared to the monomer case in Fig. 3. Its structural ensemble has remained a-helix dominated. The n16NN-2 chain in PLUM* shows a greater remaining propensity for a-helicity than n16N-2, though it too has shifted towards b-structure.    Trajectory snapshots were again used for geometric clustering. The PLUM* trajectories showed a trend of subdomains SD1 and SD2 retaining rigid, b-based conformations at the core of the dipeptide system, while SD3 is extended and labile. n16N-2's highest-ranked cluster populations are 2.6%, 1.9%, 1.0% and 1.0%, and two of these are shown in Fig. 10. n16NN-2 is more stable, and its most populous cluster at 6.9% is the structure shown in Fig. 10(b). The second-place structure with a population of 2.7% is in Fig. 10(a). PLUM produced top n16N clusters mostly consisting of two chains folded helically as in Fig. 2, sitting next to each other and interacting through their side-chains.
In order to check for differences in the stable core regions of n16N and n16NN, SD1 and SD2, between the n16N-2 and n16NN-2 PLUM* chains, these regions were also clustered separately. This resulted in the two top clusters matching those in Fig. 10. The most frequently-occurring clusters of n16NN-2 contain a larger proportion of the overall population than in the n16N-2 system, its top three scoring 3.5%, 3.2% and 1.3%, compared to 2.6%, 0.79% and 0.63%. This complements the result seen for single units of the peptides that conformational accessibility of the full chain, not just that of SD3, drops as a result of the changes from n16N to n16NN. Fig. 11 shows each residue's level of involvement in interpeptide interactions for the dimers in the PLUM* model. Combined with the clustering analysis, these data are in striking agreement with the hypothesised domain roles (Table 1). SD1 and SD2 are highly involved in interpeptide stabilisation, both by backbone and side-chain interaction. Interpeptide interactions decline after SD2, so that the tail of SD3 is largely free and unbound. A surprisingly simple difference is seen between the n16N and n16NN lines, which is a slight increase in proportion throughout, once again lending strength to the hypothesis of the SD3 changes having a full-system effect.

The S1 peptide in PLUM*
A REMD simulation of the proline-rich peptide S1 was carried out with 16 replicas, each running for 8. The trajectory at 300.0 K was analysed. Fig. 12 displays the Ramachandran heat map for this dataset. As discussed in Section 1.3, S1 is expected to fold into a polyproline II helix, with approximate (f, c) coordinates of (À751, 1601), making the result seen here disappointing. Extremely similar results were produced for S1 when the simulation was performed in the original PLUM model. S1 was simulated as a litmus test of the PLUM model's ability to model proline realistically, and its failure leads to an important caveat about PLUM's success.

Summary and conclusions
The four-bead-per-residue protein model PLUM 27 was used to simulate two peptides representative of intrinsically disordered proteins (IDP). The model places side-chain beads on a hydrophobic scale and deduces interaction strengths from simple mixing rules. Simulations of n16N in the PLUM model showed Fig. 9 Ramachandran plots of (a) n16N-2 in the PLUM* model, (b) n16N-2 in the PLUM model, and (c) n16NN-2 in the PLUM* model. Despite the relatively subtle differences between the original and altered PLUM model for single-unit n16N, the two models diverge upon simulating n16N-2. Fig. 10 The top-occurring structures for the n16N-2 system in PLUM* at 300.0 K, with populations of (a) 2.6% and (b) 1.9%. In the top structure, SD1 and SD2 form b-hairpins, turning on residues G7 and R8, and ending with a turn on I14 and P15. In the second structure, parallel b-strands bind the chains, with few intrapeptide interactions. Turns occur within the region of residues K5 to R8, and more sharply at residues I14 and P15. In both structures, only the final seven C-terminal residues form a free tail. The two peptides are distinguished by colour. over-expression of the a-helix motif, replicating the common result that IDP simulations with current models produce structures which are overly collapsed. [18][19][20][21][22] The model's backbone hydrogen bonding strength parameter was adjusted with reference to atomistic data; 58 an optimal reduction was found to be 5.5%. We refer to this adjusted model as PLUM*.
The new model had greatly enhanced success with ensemble averaged structural properties of the n16N system. Measures of secondary structure based on dihedral angles reveal a good approximation to atomistic data, especially in terms of gross structural characteristics, and semi-quantitatively on a per-peptide bond basis. Measurements of most frequent residue-residue interactions showed that the importance of tyrosine-tyrosine interactions and the subdomain SD2's interactions, proposed on the basis of bioinformatics studies 60 and atomistic simulations 58 were replicated in PLUM*. A clustering analysis of the trajectory showed that the C-terminal region known as SD3 had the greatest conformational accessibility, in keeping with the current subdomain hypothesis shown in Table 1.
The n16N-2 and n16NN-2 system simulations were extremely interesting. The fact that multiplicity of the peptide in the system vastly changes the peptides' folding and draws divergent behaviour out of each subdomain, aligning with the hypothesised aggregationdependent function and featuring disorder, is a remarkable property of the system for Bereau and Deserno's simple model to capture. This result suggests the strong possibility of a role for coarse-grained protein models of this level, albeit in modified form, in studying intrinsically disordered protein behaviour.
The simulations of the mutant peptide n16NN showed that PLUM* can distinguish small changes in a peptide; these systems had greater full-chain stability for the more frequently occurring geometric cluster, and had a stronger tendency than n16N for a-structure. Compared to n16N-2, the n16NN-2 system featured a far more stable SD1 and SD2 core; an interesting result, as these subdomains are identical in the two peptides. As it has been suggested that disorder is beneficial for molecular assembly, 38 this may be relevant to n16NN's reported difficulty aggregating. 47,50 Experimental CD spectral data and dynamic light scattering data 52 and bioinformatic studies 62 provide an insight into the oligomerisation behaviour of n16N which larger-scale PLUM* simulations could compare to. Greater structural stability is expected from n16N in larger systems and PLUMms behaviour in this situation will be an important test of the model. Failure of PLUM* to correctly simulate the proline-rich S1 peptide highlights one instance where caution is needed, as the proline residue is found at the centre of the n16N sequence. Fig. 11 The proportion of trajectory snapshots for which any given residue along the chain is involved in an interaction binding it to the other chain, in the n16N-2 and n16NN-2 systems. The n16N residue sequence is shown on the x-axis; the residues indicated in red are replaced in n16NN according to D -N and E -Q. Interactions are divided into side-chain and backbone hydrogen bond types; glycine is skipped on the side-chain datasets as it has no side-chain bead. Fig. 12 Ramachandran plot showing the exploration of (f,c) space for a single unit of S1 at 300.0 K.