Terminal repeats impact collagen triple-helix stability through hydrogen bonding

Nearly 30% of human proteins have tandem repeating sequences. Structural understanding of the terminal repeats is well-established for many repeat proteins with the common α-helix and β-sheet foldings. By contrast, the sequence–structure interplay of the terminal repeats of the collagen triple-helix remains to be fully explored. As the most abundant human repeat protein and the most prevalent structural component of the extracellular matrix, collagen features a hallmark triple-helix formed by three supercoiled polypeptide chains of long repeating sequences of the Gly–X–Y triplets. Here, with CD characterization of 28 collagen-mimetic peptides (CMPs) featuring various terminal motifs, as well as DSC measurements, crystal structure analysis, and computational simulations, we show that CMPs only differing in terminal repeat may have distinct end structures and stabilities. We reveal that the cross-chain hydrogen bonding mediated by the terminal repeat is key to maintaining the triple-helix's end structure, and that disruption of it with a single amide to carboxylate substitution can lead to destabilization as drastic as 19 °C. We further demonstrate that the terminal repeat also impacts how strong the CMP strands form hybrid triple-helices with unfolded natural collagen chains in tissue. Our findings provide a spatial profile of hydrogen bonding within the CMP triple-helix, marking a critical guideline for future crystallographic or NMR studies of collagen, and algorithms for predicting triple-helix stability, as well as peptide-based collagen assemblies and materials. This study will also inspire new understanding of the sequence–structure relationship of many other complex structural proteins with repeating sequences.


Introduction
From single amino acids to domains of over 100 residues, tandem repeating sequences are present in almost 30% of human proteins. 1 Many repeat proteins play essential roles in both basic molecular recognition and pathological aggregation. 2,3 From the ankyrin repeats and leucine zippers to the b-propellers, elucidation of the sequence-structure relationship of these modular foldings is enabled by designed oligomers of individual repeats. [4][5][6][7] The external repeats at the Nand C-ends of these proteins, oen called the terminal capping repeats, can have general folding similar to the internal repeats, and are oen carefully studied and engineered for the proteins' overall solubility and stability. 8 Furthermore, for individual repeats or modules, such as the common a-helix and b-sheet folding, there is well-established structural understanding of their terminal residues. [9][10][11][12][13] Studies of these local capping motifs have promoted understanding of the terminal and boundary structures of the repeat proteins, and inspired novel designs of engineered nanostructures and self-assembling biomolecules. [14][15][16] By contrast, there has been limited exploration of terminal capping for repeat proteins not constructed with a-helices or b-sheets, such as the collagen triple-helix.
The sequence and folding of collagen are dened by repetition. As the most abundant mammalian protein, the fundamental structure of collagen, the triple-helix, is formed by three interwinding polypeptide chains, each consisting of a long repetitive sequence of Gly-X-Y triplets, where X and Y are oen proline (Pro, P) and hydroxyproline (Hyp, O), respectively. 17 Interchain hydrogen-bonding (H-bonding) between the amide of Gly and the carbonyl of Pro stabilizes the triple-helix (Fig. 1a). 18 For decades, collagen mimetic peptides (CMP), a series of short peptides with 6-10 repeating triplets (i.e., oen made up by G, P, O), have been employed as models for understanding the structures and functions of the massive, insoluble natural collagens. [18][19][20][21] Despite collagen's unique structure and important functions in almost every human tissue type, 17 unlike the well-studied coiled-coil, 22 the sequence-structure relationship for the terminal repeats of a collagen triple-helix remains unknown. The repeating triplet of a canonical CMP sequence can take three forms: POG, GPO, and OGP (Fig. 1b). Of these, only (POG) n and (GPO) n are traditionally used in collagen research. 18,21,23,24 Interestingly, in the UniProt database, the recognized triplehelix regions of most types of human collagen chains are both initiated and terminated as GXY, rather than XYG (Table S1 †). Nonetheless, the two CMP formulae are assumed interchangeable, meaning that CMP triple-helices with equal repeats of the POG-and GPO-triplets are considered identical in structural stability. Inconsistencies in reported thermal denaturation temperature of CMPs [e.g., (POG) 8 24,27 are oen attributed to terminal functional groups and charges, 27 peptide concentrations, as well as methods and errors from different measurements, including heating rates. 18,28 So, are these collagen repeats indeed structurally equivalent, or can they make terminal cappings with different characteristics?
Here we investigate whether and how the CMP triple-helices with different terminal repeats differ in structure and stability. With CD characterization of 28 CMPs with variable terminal motifs (Table S2 †), as well as crystal structure analysis and computational simulations, we reveal that the interchain Hbonding mediated by the terminal repeat is key to the structural disorder of the helices' ends, and that disruption of it by a single change in the functional group can cause destabilization as drastic as 11-19°C in denaturation temperature. Our results indicate a fresh spatial prole of H-bonding within the collagen triple-helix, which will not only contribute to future designs of collagen model peptides, assemblies, and materials, 21,29-31 but also inspire new understandings of the sequence-structure relationship of many other complex repeat proteins. 1

GPO vs. POG
In this study, we rst validated that the peptide length and instrument heating rate can affect the CMPs' thermal stability ( Fig. S1 and S2 †). We monitored the stability by circular dichroism (CD), where a CMP triple-helix dissociated into single chains under gradual heating, and the steepest point of this two-state transition curve is dened as the melting temperature (T m , see Methods). To avoid measurement bias or errors, we carefully prepared CMP 1 [Ac-(POG) 7 -NH 2 ] and 2 [Ac-(GPO) 7 -NH 2 ] and examined their triple-helix stability under the same condition. Despite their identical chain length and amino acid composition, the CD melting curves showed that the T m value of CMP 2 is 10°C higher than that of CMP 1 (Fig. 1c). More strikingly, the T m value of every Ac-(POG) n -NH 2 sequence (n ¼ 5-9) is at least 7°C lower than its GPO counterpart in the series (Fig. S2 †).

The terminal Gly
The sequence difference between CMP 1 and 2 only lies at two ends: CMP 1 has an extra C-terminal Gly while CMP 2 has an extra N-terminal one (Fig. 2a). To clarify the effect of each terminal Gly on the triple-helix stability, we made CMP 3, featuring Gly at both termini (Fig. 2a). The T m value of CMP 3 (47°C) was only 2°C higher than CMP 1 (Fig. 2b), suggesting that the extra N-terminal Gly makes almost no contribution to stability. To test whether the extra Gly adds H-bonds, we designed two CMPs that are decient in H-bond donation at the N-termini: CMP 3S N features an N-acetylated sarcosine (Sar) residue which lacks the amide hydrogen, and CMP 3A has a terminal amine which creates interchain charge-repulsion at physiological pH (Fig. 2b). The T m values of CMP 3S N and 3A were 41-42°C, which were not far from CMP 1 and 3 (Fig. 2b). Considering that CMP 3S N and 3A also involve other destabilizing factors at the N-terminal (steric and charge repulsions), these results suggested that the N-terminal Gly contributes very weakly to interchain H-bonding and the triple-helix stability.
At the C-terminus, even with one more residue in sequence, the T m value of CMP 3 was 8°C lower than CMP 2 ( Fig. 2a and c), indicating that the additional C-terminal Gly strongly destabilizes the triple-helix in CMP 1 and 3. Next, we made CMP 3S C , 2O, and 2E, all with little or no capability to form the C-terminal most interchain H-bond: CMP 3S C features N-methylated Sar and CMP 2E is capped with a hydrogen-decient ester, while CMP 2O ends with a negatively-charged carboxyl group at physiological pH (Fig. 2c). The T m values of CMP 3S C , 2O, and 2E were all drastically lower than CMP 2 (DT m : 11-19°C). Amazingly, with the substitution of just one functional group at the C-end (i.e., CONH 2 / COOCH 3 ), the triple-helix stability decreased by 10°C (CMP 2 vs. 2E). These results suggested the C-terminal Hyp-amide highly likely contributes to new Hbonding that stabilizes CMP 2. Furthermore, our data indicated that completely abolishing the C-terminal H-bonding and inducing local sterics with Sar destabilizes the triple-helix by 13°C (Fig. 2c, CMP 2 vs. 3S C ), while attaching C-terminal Gly destabilizes the helix by 8°C (CMP 2 vs. 3). These results suggested that the C-terminal Hyp-HN-Gly in CMP 3 and 1 probably only forms a particularly weak interchain H-bond.

Crystal structures
Next, we surveyed existing crystal structures of CMP triplehelices in the Protein Data Bank (PDB, see Table S3 †) to search for evidence of structural differences between CMPs with POG-and GPO-terminal repeats. 23,24,[32][33][34][35][36][37][38][39][40][41] We analyzed the Bfactor of each CMP structure as it oen correlates with the exibility and internal motion in protein crystallography. 42 We plotted normalized B-factors of all non-hydrogen atoms along each CMP triple-helix: while all structures have elevated structural exibility at the termini, a general trend of higher terminal B-factor was noted for the POG-sequences (Fig. 3a, S3 and S4 †). We calculated the N-and C-ending amino acid triplet's B-factor deviation from the mean B-factor of all atoms in a given triplehelix of all crystal structures (C-terminal: Fig. 3b, N-terminal: Fig. S5, † see Methods). The deviation values showed that the POG-CMPs have higher exibility than the GPO ones at the Ctermini. We also noted that the crystal structures of the POGsequences are more likely to have unresolved or missing terminal residues than the GPO ones ( Fig. 3b, asterisks, Table  S3 †), further implying that the POG-ended C-termini may be more disordered. Finally, we noted that the distances and angles between the C-terminal Hyp-NH 2 and Pro-C]O are suitable for creating interchain H-bonds in multiple GPO crystal structures ending with Hyp-amide ( Fig. 3c and Table S3 †).

Molecular dynamics (MD) simulations
To further understand the CMP difference in terminal exibility and thermal stability, we used fully atomistic MD simulations to build CMP 1, 2, 3, and 2E and fully relaxed them (see details in Methods). 43,44 We computed the root-mean-square deviation (RMSD) and radius of gyration (R g ) of amino acid triplets at representative locations, namely the acetylated N-terminus, the triplet in the center, and the C-terminus of interest (Fig. 4a, S6 and S7 †). The RMSD value measures the mean deviation of each atom within the region from its initial conformation, and it is used to quantify random migration because of thermal uctuation. R g measures the mean size of the atoms within the region. The similar RMSD and R g values for the four CMPs at the N-terminus and center suggest that they have very similar dynamics and size during simulation (Fig. 4a); this is expected as the four CMPs share the same or similar chemical structures at these two locations. However, the RMSD of CMP 2 at the Cterminus is signicantly lower and R g is signicantly smaller than the other three CMPs, suggesting the C-terminus of CMP 2 (i.e., Hyp-CONH 2 ) moves less during the thermal uctuation and keeps a more compact size (Fig. 4a). This result correlates nicely with our observation of the relatively lower B-factors for the CMPs with the C-terminal GPO repeat (Fig. 4a).
We also compared the distribution of the H-bonds as the timeaverage number of H-bonds between any pair of the residues within these CMPs (Fig. 4b). It was shown that CMP 2 has Hbonds homogenously distributed along each of the three chains with strong H-bonds near the C-termini (yellow spots), while the other three sequences have missing H-bonds during the relaxation at their C-termini (arrows). For example, CMP 3 misses the interchain H-bonding between chain 2 and 3 (at residue 44 and 66), while CMP 1 misses H-bonding between chain 2 and 3 (at residue 42 and 63), and CMP 2E misses H-bonding between chain 1 and 2 (at residue 21 and 42). The pattern of the missing H-bonds corresponds to the partially loose structure at the C-termini of these three CMP molecules, as shown by the relaxed molecular structure: two of the three CMP chains are tightly bonded while the third one is not (Fig. 4a, dotted circles). Together, our simulations supported that except for CMP 2, these triple-helices (with either HypGly-CONH 2 or Hyp-COOCH 3 as end-moiety) have weakened H-bonds and loose structures at the C-termini.

Differential scanning calorimetry (DSC)
To directly interrogate whether CMP 2 has greater interchain Hbonding, we obtained the thermal denaturation curves of CMP 1, 2, 3, 3S C , and 2E using DSC, and measured the enthalpy change (DH) for each peptide (Fig. 4c and S8, † see Methods). 45,46 CMP 2 showed the highest DH value, which was 6.4 kcal mol −1 higher than CMP 3, and 7.2 kcal mol −1 higher than CMP 1. Also, the DH value of CMP 3 was close to CMP 3S C , which lacks the Cterminal H-bonding due to N-methylation. All of these data are in line with our CD T m measurements and support that the Cterminal Hyp-CONH 2 of CMP 2 is engaged in interchain Hbonds which are weakened with the appendant Gly in CMP 3. Meanwhile, the DH value of CMP 1 was almost the same as CMP 3, also supporting that the extra N-terminal Gly in CMP 3 barely contributes to stability.

Terminal Pro and Hyp residues
Using the approach described in Fig. 2, we studied the structural effects of Pro and Hyp on each end (Fig. 5). For Pro, the T m comparisons indicated that an extra Pro at either N-or Cterminus can stabilize the triple-helix by 7-8°C (Fig. 5a). For Hyp, while adding Hyp to the N-terminus had little contribution to stability (DT m ¼ 1°C), incorporating a C-terminal Hyp can raise the T m by 12°C (Fig. 5b). Aer studying the effect of the terminal residue on the thermal stability of CMPs of the same length, we measured the CMP stability change during incremental sequence extension from (GPO) 7 to (GPO) 8 for both Nand C-directions (Fig. 5c). By sequentially adding O, P, and G residues from the N-terminal, we found that the greatest T m increase occurred with Pro (Fig. 5c, le). At the C-terminal, adding Pro compensated the T m fall caused by Gly, while the biggest jump in T m came with Hyp (Fig. 5c, right). We conducted additional measurements for (POG) 7 / (POG) 8 and (OGP) 7 / (OGP) 8 and obtained data in line with Fig. 5c (Fig. S9 and S10 †).

A hydrogen-bonding map
Based on the simulation, DSC, and all T m data (Fig. 2-5 and Table S2 †), a schematic map of possible interchain Pro-C]O/ HN-Gly H-bond patterns can be sketched for the three CMP models with different repeating units (Fig. 6a). For these Nacetylated peptides, the main difference lies in the C-terminal regions. For Ac-(OGP) 7 -NH 2 (CMP 4), the last Pro/Gly Hbonds cannot form due to lack of the Gly H-bond donor; for Ac-(POG) 7 -NH 2 (CMP 1), although the C-terminal Pro-C]O could bond with the ending HN-Gly, the exible Gly apparently interferes this interaction (Fig. 2-4). In contrast, for Ac-(GPO) 7 -NH 2 (CMP 2), "extra" C-terminal-most H-bonds can possibly form between Pro's carbonyl and Hyp's ending NH 2 group (Fig. 3c and 4c), resulting in the peptide's higher triple-helix stability. This H-bonding map can help explain the inconsistent effects of the terminal charges on the three CMP sequences. For example, substituting a neutral C-terminal amide with a negatively-charged carboxyl group in (GPO) 7 may abolish the extra C-terminal H-bonding in CMP 2 (green block, Fig. 6a), thus dramatically lowering the T m value by 19°C, far exceeding DT m values of the other two counterparts (Fig. 6b). At the unacetylated N-terminal, it can be expected that positive charge repulsion destabilizes the POG sequence the most (Fig. 6c) since only when Pro is the N-terminal most residue, the end charge repulsion can directly weaken the interchain H-bonding (Fig. 6a, note the locations of the three N-terminal green blocks).

CMP-collagen hybridization
We previously reported that CMP single-strands can bind to and form hybridized triple-helices with unfolded natural collagen chains in pathological tissues and denatured collagen materials (i.e., gelatin). [47][48][49][50] The collagen hybridization is strongly driven by the triple-helix folding propensity of the CMPs. To test whether CMPs only different in terminal repeat can bind to denatured collagen with the same affinity, we prepared carboxyuorescein-labeled CMP 1, 2, and 4 (designated as 1F, 2F, and 4F) and compare their binding to unfolded collagen on gelatin-coated assay plates (Fig. 6d) and paraffin-embedded sections of rat hearts (Fig. 6e). To enable CMP-collagen hybridization, the F-CMPs were dissociated to single-strands by heating at 85°C before binding, and the heart sections had undergone heated-mediated antigen-retrieval to completely denature their collagen content (see Methods). 48 We found that, on both the gelatin coating and the heart sections, CMP 2F showed the highest affinity to denatured collagen, followed by CMP 4F and 1F (Fig. 6d and e). In the heart sections co-stained with CMP 2F and an anti-collagen I antibody, the positive CMP and antibody signals strongly overlapped (Fig. S11 †), validating the peptide's high specicity to collagen. These results demonstrated that the GPO-featuring CMP 2F has the strongest triple-helical folding propensity among the three forms during CMP-collagen hybridization.

Discussion
Our main nding is that the interchain H-bonding determines the structure of CMP's different terminal repeats. Previous crystallographic studies of CMP triple-helices revealed that the terminal amino acids oen lack interchain H-bonding and splay away from the core helical axis, giving them higher mobilities and B-factors. 33,[51][52][53] Comparative NMR analysis of Ac-(POG) 10 -NH 2 also revealed stretches of disorder as wide as six amino acids at the C-terminus. 51 Interestingly, these reports were predominantly based on the POG-repeating sequences. In addition to supporting these prior ndings (Table S3 †), our study discovered that the GPO-repeating sequences can form an extra set of stabilizing inter-helix H-bonds at the C-terminal. As evidence, the T m gap between Ac-(POG) 7 -NH 2 and Ac-(GPO) 7 -NH 2 is 10°C (Fig. 1), which is essentially equal to the T m increase gained by adding a triplet unit [Fig. S2, † Ac-(POG) 7 -NH 2 / Ac-(POG) 8 -NH 2 : 45 / 56°C; Ac-(GPO) 7 -NH 2 / Ac-(GPO) 8 -NH 2 : 55 / 64°C]. Second, it was reported that substituting one Gly to aza-Gly, a synthetic residue that can form one additional cross-chain H-bond, also increases the T m value of (POG) 7 by 11°C. 54 Third, the energy of an inter-helix Pro/Gly H-bond was estimated as 2.0 kcal mol −1 , 55 while unfolding DH value of Ac-(GPO) 7 -NH 2 was 7 kcal mol −1 greater than Ac-(POG) 7 -NH 2 (Fig. 4c), comparable to three H-bonds. These reported results are well in line with our data, supporting the creation of extra H-bonds by the C-terminal Hyp-CONH 2 .
All our data suggest the exible Gly as the cause of POG's inability to form stable H-bonds at the C-terminus (Fig. 2-6). Because Pro and Hyp both lack the N-hydrogen atom, Gly is the sole interchain H-bond donor in the whole triple-helix (Fig. 1a). 18 Unlike salt bridges that can spontaneously form by electrostatic attraction from any direction, H-bond formation requires the participating functional groups to be within proper distances and angles. In the central triplets, the Hyp and Pro anking a Gly residue ensure the peptide's polyproline-II-helix conformation, thereby offering the proper angle for Gly to form the interchain H-bond. However, it can be envisioned that at the C-terminus, with reduced conformational restrictions, Gly exhibits a high degree of disorder and lacks a dened backbone structure (see MD simulation in Fig. S12 †), which can lead to H-bond disruption. This may also explain why adding Pro to the C-terminal Gly recovers the T m value by 7-8°C (Fig. 5, S9 and S10 †). Based on 1 H-15 N NMR experiments, a concurrent study on the similar topic also suggested the Gly exibility at the N-and C-termini. 56 Meanwhile, although the interchain H-bond is formed by the carbonyl of Pro and the amine of Gly (Fig. 1a, red and blue boxes), these two functional groups are covalently connected by the Hyp in-between (i.e., /O]C-Hyp-NH/) at the Y position. Thereby the conformation of Hyp, which induces specic backbone folding, can directly affect the bond angles of all interchain H-bonds within the collagen triple-helix. This provides a structural insight for post-translational hydroxylation of Pro that almost exclusively occurs at the Y position of natural collagen chains. 17 Given the several variables we examined in this work, including the sequence and length (Fig. S2 †), the terminal residue ( Fig. 2 and 5) and the charge (Fig. 6b and c), as well as the CD heating rate (Fig. S1 †), it is possible to explain the various T m values of similar CMPs from our study and earlier publications (See Table S4 † for example). 57 More importantly, based on our ndings, conicting results from previous reports can now be reconciled with the terminal repeat argument [e.g., (POG) 7 : 43°C vs. (GPO) 7 : 55°C]. 24,27 The N-termini of most existing POG-based crystal structures are disordered (Table  S3 †), probably because those N-terminal Pro residues are not acetylated, resulting in charge repulsion disrupting the Hbonding (Fig. 6a). Meanwhile, it was recently reported that the positive charges of ammonium groups destabilize the triplehelix [e.g., H-(POG) 7 -NH 2 , pH 7.4 vs. 10.6, DT m ¼ 6°C] to a greater extent than the negative charges of carboxylate groups [e.g., Ac-(POG) 7 -NH 2 vs. Ac-(POG) 7 -OH, DT m ¼ 3°C at pH 7.4]. 27 This discrepancy is probably because the charge repulsion at the N-terminal Pro weakens the H-bonding more than the fraying Gly at the C-terminus. Our ndings also suggest that CMPs featuring Ac-(G)POG/GPO-NH 2 as the ending motifs are more likely to have reduced terminal exibility and may be more suitable for future crystallographic or NMR studies.
Conventionally, Gly was preferred as the C-terminal residue in many collagen peptide studies probably because of the affordability of Gly-preloaded resins and the reduced risks of epimerization owning to its lack of chirality. For decades, (POG) n and (GPO) n have been considered interchangeable. 28,35,58,59 Our study disproves this assumption and points out the need to note the terminal repeats when comparing CMPs from different works. The role of the common terminal functional groups in the triple-helix stability of CMP was recently highlighted 27 and incorporated into an algorithm for predicting the stability of collagen triple-helices. 21 Our ndings show that the reported effects of the terminal functional groups only apply to the POG end motif, 27 but not the terminal OGPand GPO-repeats (Fig. 6b and c). Our study emphasizes the need and provides the reference to account for the difference in terminal repeats in such algorithms to avoid unexpected biases. 21 Our work showed that the terminal repeats affect not only the assembly of CMP homo-trimers but also how strongly the peptide strands form hybrid triple-helices with natural collagen chains ( Fig. 6d and e). For applications, this study will provide helpful guidance in designing potent collagen targeting probes 47,48 and fabricating synthetic collagen materials. 29,30 Meanwhile, similar investigations of terminal repeats have been rare for brous structural proteins, such as keratin, silk broin, elastin, brin, and myosin, many of which are insoluble and lack in crystallography based structural elucidation. Our ndings and methods may inspire new investigations into the folding of these repeat proteins, particularly for the sequencestructure relationship at their termini.

Data availability
The data that support the ndings of this study are available within the article and its ESI, † or from the corresponding author on reasonable request.