Mechanical codes of chemical-scale specificity in DNA motifs

In gene transcription, certain sequences of double-stranded (ds)DNA play a vital role in nucleosome positioning and expression initiation. That dsDNA is deformed to various extents in these processes leads us to ask: Could the genomic DNA also have sequence specificity in its chemical-scale mechanical properties? We approach this question using statistical machine learning to determine the rigidity between DNA chemical moieties. What emerges for the polyA, polyG, TpA, and CpG sequences studied here is a unique trigram that contains the quantitative mechanical strengths between bases and along the backbone. In a way, such a sequence-dependent trigram could be viewed as a DNA mechanical code. Interestingly, we discover a compensatory competition between the axial base-stacking interaction and the transverse base-pairing interaction, and such a reciprocal relationship constitutes the most discriminating feature of the mechanical code. Our results also provide chemical-scale understanding for experimental observables. For example, the long polyA persistence length is shown to have strong base stacking while its complement (polyAc) exhibits high backbone rigidity. The mechanical code concept enables a direct reading of the physical interactions encoded in the sequence which, with further development, is expected to shed new light on DNA allostery and DNA-binding drugs.

(1) Next, the local tangent vector at the i th base pair to the rod is approximated by and the shape of the rod at the i th base pair, θ(s(r i )), can be calculated by θ(s(r i )) = arccos (t i • t 4 ) i = 4 to 18.
(3) Furthermore, the shape θ(s) can be expressed as the superposition of Fourier modes where the n th mode amplitude In the assumption of the elastic rod, the energy of bending, U, is a quadratic sum of the mode amplitudes where θ 0 (s) is the shape in the absence of external forces, a 0 n is the amplitude of the mode in the absence of an external force and EI is the flexural rigidity.According to the equipartition principle, each mode has 1 2 k B T of thermal energy in it, so that where Var(a n ) is the variance of the n th mode amplitude and L p is the persistence length.
Therefore, the calculation of Var(a n ) from MD trajectory, together with the knowledge of the contour length, provides an estimate of L p .S2 Here, the mode index n is set to 2 for the reason that the wavelength of n = 2 is consistent with the length scale of the bending deformation at the whole molecule level.

Categorization of rigidity graphs
Base-pairing hydrogen bond (hb) The rigidity graph K hb n is aimed at characterizing the strengths of the canonical Watson-Crick base pairing for the trajectory window n.The Base-ribose (BR) The rigidity graph K BR n is designed to capture the coupling strengths between the nucleobase and ribose in a nucleotide for the trajectory window n.The vertices comprise the six atoms of ribose (C1', C2', C3', C4', O4', O3') and all heavy atoms of nucleobase of all nucleotides in the two strands of dsDNA, and the edges are those haENM springs defined in between the nucleobase and the ribose for each nucleotide.Notice that the atom pairs coinciding with covalent bonds and bond angles are excluded.The edge weights of K BR n are the corresponding spring constants k BR ij of the window n.
Ribose-phosphate (RP) The rigidity graph K RP n is designed to quantify the rigidity of the ribose-phosphate backbone in the two strands of dsDNA for the trajectory window n.
The vertices comprise the atoms of all riboses and phosphate groups in dsDNA, and the edges are the connecting springs between a ribose and a phosphate group.Notice that the atom pairs coinciding with covalent bonds and bond angles are excluded.The edge weights of K RP n are the corresponding spring constants k RP ij of the window n.
The correlation between base-pairing rigidity and basestacking rigidity in the transcription regulatory sequences To understand how each coupling strength of {a-b} hb correlates with those of {a-b} st , a detailed analysis is performed for the transcription regulatory sequences, Figures S18-S21.At the level of inter-atom couplings for each base p, the strength of each atom pair in the {a-b} st list of stacking hotspots is plotted with the strength of each base pairing hydrogen bond.
Despite the noisier data, each inter-atom coupling of stacking shows negative correlation with all three base-pairing interactions in polyA, Figure S18a.For polyG, negative stacking-base pairing correlation in strength is observed at the major-groove side for the weakest hydrogen bond, Figure S19a.At base-middle where the hydrogen bond is about twice stronger, the stacking strength is almost independent of the base pairing strength.At the minor-groove S-4 side of the highest-strength hydrogen bond in polyG, the more than five times weaker stacking becomes positively correlated with the hydrogen bonding strength.Consistent with the behavior observed in polyA, both base-stacking rigidity of TpA(AT) and TpA(TA) have negative correlation with the base-pairing rigidity, Figure S20a.Within the CpG strands, the prominent stacking strengths have positive correlation with the minor-groove side hydrogen bonding spring constant as in polyG, while negative correlation with those at base middle and the major-groove side is also observed, Figure S21a.These results corroborate the tight connection of stacking with base-pairing interactions and the ordering of hydrogen bonding strength across grooves casts significant influence on base stacking.The molecular illustrations of all pairs in {a-b} BR (orange lines).For purine, the common pattern of {a-b} BR is {O4'-C4, O4'-C8}, but the O4'-C8 spring is not included in the {a-b} BR of CpG(G) (green solid line).In the case of pyrimidine, the general pattern of {a-b} BR is {O4'-C2, O4'-C6}, but the O4'-C2 spring is not included in the {a-b} BR of polyA c (green solid line).For the purpose of showing the effect of base pairing on the pattern of BR mechanical hotspots, the atom pairs in {a-b} hb are also demonstrated (green dotted lines).

S-22
length of the rod, L, thereby can be expressed as vertices and edges of each A-T base pair are selected based on the heavy atoms of the two hydrogen bonds in addition to the C2-O2 atomtype pair at the minor-groove side, as shown in Figure2.For each G-C base pair, the vertices and edges are selected based on the heavy atoms of the three hydrogen bonds.The edge weights of K hb n are the corresponding spring constants k hb ij of the window n.The selection of the list of most strongly coupled atomtype pairs {a-b} hb is trivial, i.e., {C2-O2, N1-N3, N6-O4} for the A-T base pair and {N2-O2, N1-N3, O6-N4} for the G-C base pair.Base-stacking (st) The rigidity graph K st n is designed to capture the strengths of stacking interaction for the trajectory window n.The vertices comprise all nucleobase atoms in the two strands of dsDNA, and the edges are the springs in between adjacent nucleobases defined in the haENM.The edge weights of K st n are the corresponding spring constants k RP ij of the window n.

Movie S1
Axial-transverse competition in polyA and polyG (top view).The persistent hydrogen bonds in polyG come with movements parallel to the base-pairing plane and such transverse displacements would disrupt the base stacking.With the very weak A-T pairing at the minor-grove side, alternative base movements of more intact stacking toward the majorgroove side are observed.In the larger base strand of polyA, movements along the adenine plane are clearly more restrained when comparing to the guanine strand of polyG, and the slide and shift of polyG indeed have wider distributions than those of polyA, FigureS16.Movie S2 Axial-transverse competition in polyA and polyG (side view).Movie S3 Structural dynamics of backbone and base-stacking in polyA and polyG.The backbone of polyA c exhibits the slightest fluctuation and stay in the BI state, which is consistent with the exceptionally high k RP p values in polyA c .On the other hand, the backbones of polyA, polyG and polyG c exhibit large fluctuation and have frequent transitions between the BI state and the BII state.In the cases of polyG and polyG c , strong base pairing makes the bases move more frequently, thereby leading to the large fluctuation in their backbones.Movie S4 Structural dynamics of backbone and base-stacking in TpA and CpG.The backbone dynamics of TpA(AT) is similar to the one of polyA c , which tends to being the BI state.The TpA(TA) backbone, on the other hand, is similar to polyA with a noticeable BII population.In CpG, it can be noticed that the backbone of CpG(GC) stays in the BII state for a longer time, but the backbone of CpG(CG) prefers to stay in the BI state.

Figure S1 :
Figure S1: Identification of prominent modes in K st of polyA and K st of polyA c .(a) The distributions of averaged mean-mode content r st α and eigenvalue λ st α , where α is the index of mode.Statistical outliers of the λ st α distribution (λ st α > 1.5 IQR, the red dotted line) that also have high mean-mode contents ( r st α > 0.8, green dotted line) are identified as the prominent modes of K st , as the orange dots shown.(b) For each atom in polyA and polyA c , the magnitude of the α th prominent mode, |ν st α |, quantifies the importance of the atom for K st .For the purpose of illustration, |ν st 1 | of polyA and |ν st 1 | of polyA c are shown.Each separate bar chart shows |ν st 1 | for a specific atom in base A (polyA) or base T (polyA c ) along the nucleotide index.The high weight atoms of K st are atoms whose |ν st α | are higher than 0.1 (blue dotted lines).

Figure S2 :
Figure S2: Identification of prominent modes in K st of polyG and K st of polyG c .(a) The distributions of averaged mean-mode content r st α and eigenvalue λ st α , where α is the index of mode.Statistical outliers of the λ st α distribution (λ st α > 1.5 IQR, the red dotted line) that also have high mean-mode contents ( r st α > 0.8, green dotted line) are identified as the prominent modes of K st , as the orange dots shown.(b) For each atom in polyG and polyG c , the magnitude of the α th prominent mode, |ν st α |, quantifies the importance of the atom for K st .For the purpose of illustration, |ν st 1 | of polyG and |ν st 1 | of polyG c are shown.Each separate bar chart shows |ν st 1 | for a specific atom in base G (polyG) or base C (polyG c ) along the nucleotide index.The high weight atoms of K st are atoms whose |ν st α | are higher than 0.1 (blue dotted lines).

Figure S3 :
Figure S3: Identification of prominent modes in K st of TpA and K st of CpG.(a) The distributions of averaged mean-mode content r st α and eigenvalue λ st α , where α is the index of mode.Statistical outliers of the λ st α distribution (λ st α > 1.5 IQR, the red dotted line) that also have high mean-mode contents ( r st α > 0.8, green dotted line) are identified as the prominent modes of K st , as the orange dots shown.For each atom in TpA and CpG, the magnitude of the α th prominent mode, |ν st α |, quantifies the importance of the atom for K st .For the purpose of illustration, (b) |ν st 1 | of TpA and (c) |ν st 1 | of CpG are shown.Each separate bar chart shows |ν st 1 | for a specific atom in base A (TpA(A)), base T (TpA(A)), base G (CpG(G)) or base C (CpG(C)) along the nucleotide index.The high weight atoms of K st are atoms whose |ν st α | are higher than 0.1 (blue dotted lines).

Figure S4 :
Figure S4: Identification of prominent modes in K BR of polyA and K BR of polyA c .(a) The distributions of averaged mean-mode content r BR α and eigenvalue λ BR α , where α is the index of mode.Statistical outliers of the λ BR α distribution (λ BR α > 1.5 IQR, the red dotted line) that also have high mean-mode contents ( r BR α > 0.8, green dotted line) are identified as the prominent modes of K BR , as the orange dots shown.(b) For each atom in polyA and polyA c , the magnitude of the α th prominent mode, |ν BR α |, quantifies the importance of the atom for K BR .For the purpose of illustration, |ν BR 1 | of polyA and |ν BR 1 | of polyA c are shown.The bar chart shows |ν BR 1 | for the atoms of ribose and the atoms of base (polyA: A, polyA c : T).The high weight atoms of K BR are atoms whose |ν BR α | are higher than 0.1 (blue dotted lines).

Figure S5 :
Figure S5: Identification of prominent modes in K BR of polyG and K BR of polyG c .(a) The distributions of averaged mean-mode content r BR α and eigenvalue λ BR α , where α is the index of mode.Statistical outliers of the λ BR α distribution (λ BR α > 1.5 IQR, the red dotted line) that also have high mean-mode contents ( r BR α > 0.8, green dotted line) are identified as the prominent modes of K BR , as the orange dots shown.(b) For each atom in polyG and polyG c , the magnitude of the α th prominent mode, |ν BR α |, quantifies the importance of the atom for K BR .For the purpose of illustration, |ν BR 1 | of polyG and |ν BR 1 | of polyG c are shown.The bar chart shows |ν BR 1 | for the atoms of ribose and the atoms of base (polyG: G, polyG c : C).The high weight atoms of K BR are atoms whose |ν BR α | are higher than 0.1 (blue dotted lines).

Figure S6 :
Figure S6: Identification of prominent modes in K BR of TpA(A) and K BR of TpA(T).(a) The distributions of averaged mean-mode content r BR α and eigenvalue λ BR α , where α is the index of mode.Statistical outliers of the λ BR α distribution (λ BR α > 1.5 IQR, the red dotted line) that also have high mean-mode contents ( r BR α > 0.8, green dotted line) are identified as the prominent modes of K BR , as the orange dots shown.(b) For each atom in TpA(A) and TpA(T), the magnitude of the α th prominent mode, |ν BR α |, quantifies the importance of the atom for K BR .For the purpose of illustration, |ν BR 1 | of TpA(A) and |ν BR 1 | of TpA(T) are shown.The bar chart shows |ν BR 1 | for the atoms of ribose and the atoms of base (TpA(A): A, TpA(T): T).The high weight atoms of K BR are atoms whose |ν BR α | are higher than 0.1 (blue dotted lines).

Figure S7 :
Figure S7: Identification of prominent modes in K BR of CpG(G) and K BR of CpG(C).(a) The distributions of averaged mean-mode content r BR α and eigenvalue λ BR α , where α is the index of mode.Statistical outliers of the λ BR α distribution (λ BR α > 1.5 IQR, the red dotted line) that also have high mean-mode contents ( r BR α > 0.8, green dotted line) are identified as the prominent modes of K BR , as the orange dots shown.(b) For each atom in CpG(G) and CpG(C), the magnitude of the α th prominent mode, |ν BR α |, quantifies the importance of the atom for K BR .For the purpose of illustration, |ν BR 1 | of CpG(G) and |ν BR 1 | of CpG(C) are shown.The bar chart shows |ν BR 1 | for the atoms of ribose and the atoms of base (CpG(G): G, CpG(C): C).The high weight atoms of K BR are atoms whose |ν BR α | are higher than 0.1 (blue dotted lines).

Figure S8 :
Figure S8: Identification of prominent modes in K RP of polyA and K RP of polyA c .(a) The distributions of averaged mean-mode content r RP α and eigenvalue λ RP α , where α is the index of mode.Statistical outliers of the λ RP α distribution (λ RP α > 1.5 IQR, the red dotted line) that also have high mean-mode contents ( r RP α > 0.8, green dotted line) are identified as the prominent modes of K RP , as the orange dots shown.(b) For each atom in polyA and polyA c , the magnitude of the α th prominent mode, |ν RP α |, quantifies the importance of the atom for K RP .For the purpose of illustration, |ν RP 1 | of polyA and |ν RP 1 | of polyA c are shown.Each separate bar chart shows |ν RP 1 | for a specific atom in ribose-phosphate backbone along the nucleotide index.The high weight atoms of K RP are atoms whose |ν RP α | are higher than 0.1 (blue dotted lines).

Figure S9 :
Figure S9: Identification of prominent modes in K RP of polyG and K RP of polyG c .(a) The distributions of averaged mean-mode content r RP α and eigenvalue λ RP α , where α is the index of mode.Statistical outliers of the λ RP α distribution (λ RP α > 1.5 IQR, the red dotted line) that also have high mean-mode contents ( r RP α > 0.8, green dotted line) are identified as the prominent modes of K RP , as the orange dots shown.(b) For each atom in polyG and polyG c , the magnitude of the α th prominent mode, |ν RP α |, quantifies the importance of the atom for K RP .For the purpose of illustration, |ν RP 1 | of polyG and |ν RP 1 | of polyG c are shown.Each separate bar chart shows |ν RP 1 | for a specific atom in ribose-phosphate backbone along the nucleotide index.The high weight atoms of K RP are atoms whose |ν RP α | are higher than 0.1 (blue dotted lines).

Figure S10 :
Figure S10: Identification of prominent modes in K RP of TpA and K RP of CpG.(a) The distributions of averaged mean-mode content r RP α and eigenvalue λ RP α , where α is the index of mode.Statistical outliers of the λ RP α distribution (λ RP α > 1.5 IQR, the red dotted line) that also have high mean-mode contents ( r RP α > 0.8, green dotted line) are identified as the prominent modes of K RP , as the orange dots shown.For each atom in TpA and CpG, the magnitude of the α th prominent mode, |ν RP α |, quantifies the importance of the atom for K RP .For the purpose of illustration, (b) |ν RP 1 | of TpA and (c) |ν RP 1 | of CpG are shown.Each separate bar chart shows |ν RP 1 | for a specific atom in ribose-phosphate backbone along the nucleotide index.The high weight atoms of K RP are atoms whose |ν RP α | are higher than 0.1 (blue dotted lines).

Figure S11 :
Figure S11: The distributions of k m a-b for all types of base step (m=st,RP) or base (m=BR) shown as the boxplots.k m a-b is the average of the corresponding k m ij of a particular atom pair a-b over all base steps or bases.The atom pairs whose k m a-b is larger than Q3 are recognized as the outliers (dots).For m=st,RP, the atom pairs having the top three k m a-b are selected as the most strongly atom pairs (red dots).For m=BR, the atom pairs having the top two k m a-b are selected as the most strongly atom pairs (red dots).

Figure S12 :
Figure S12: Most strongly coupled atom pairs {a-b} st .(a) The list of the strength k st a-b (kcal/mol/ Å2 ) for {a-b} st of all types of base step in the transcription regulatory sequences.The strength of a particular pair k st a-b is the average of the corresponding k st ij over all central base steps.(b) The molecular illustrations of all pairs in {a-b} st (red lines).For the purpose of comparing the base-stacking rigidity with base-pairing rigidity, the strength k hb a-b of atom pairs in {a-b} hb are also demonstrated (green dotted lines).

Figure S13 :
Figure S13: Most strongly coupled atom pairs {a-b} BR .(a) The list of the strength k BR a-b (kcal/mol/ Å2 ) for {a-b} BR of all types of base in the transcription regulatory sequences.The strength of a particular pair k BR a-b is the average of the corresponding k BR ij over all central bases.(b)The molecular illustrations of all pairs in {a-b} BR (orange lines).For purine, the common pattern of {a-b} BR is {O4'-C4, O4'-C8}, but the O4'-C8 spring is not included in the {a-b} BR of CpG(G) (green solid line).In the case of pyrimidine, the general pattern of {a-b} BR is {O4'-C2, O4'-C6}, but the O4'-C2 spring is not included in the {a-b} BR of polyA c (green solid line).For the purpose of showing the effect of base pairing on the pattern of BR mechanical hotspots, the atom pairs in {a-b} hb are also demonstrated (green dotted lines).

Figure S14 :
Figure S14: Most strongly coupled atom pairs {a-b} RP .(a) The list of the strength k RP a-b (kcal/mol/ Å2 ) for {a-b} RP of all types of base step in the transcription regulatory sequences.The strength of a particular pair k RP a-b is the average of the corresponding k RP ij over all central base steps.(b) The molecular illustrations of all pairs in {a-b} RP (magenta lines).

Figure S16 :
Figure S16: The characterization of base step parameters in the MD simulations of the transcription regulatory sequences.

Figure S17 :
Figure S17: The characterization of base pair parameters in the MD simulations of the transcription regulatory sequences.

Figure S18 :
Figure S18: The correlation in coupling strengths (kcal/mol/ Å2 between different mechanical compartments in polyA and polyA c .The molecular illustrations of {a-b} m (m=hb, st, BR and RP) are shown in the top panel.(a) k st a-b,p -k hb a-b,p plots, (b) k BR a-b,p -k st a-b,p plots and (c) k RP a-b,p -k st a-b,p plots.For m=hb and m=BR, k m a-b,p is the average of force constant over all trajectory windows for the atom pair a-b at base p.For m=st and m=RP, k m a-b,p is the average of force constant over all trajectory windows for the atom pair a-b whose atom a is at base p and atom b is at base p + 1.

Figure S19 :
Figure S19: The correlation in coupling strengths (kcal/mol/ Å2 ) between different mechanical compartments in polyG and polyG c .The molecular illustrations of {a-b} m (m=hb, st, BR and RP) are shown in the top panel.(a) k st a-b,p -k hb a-b,p plots, (b) k BR a-b,p -k st a-b,p plots and (c) k RP a-b,p -k st a-b,p plots.For m=hb and m=BR, k m a-b,p is the average of force constant over all trajectory windows for the atom pair a-b at base p.For m=st and m=RP, k m a-b,p is the average of force constant over all trajectory windows for the atom pair a-b whose atom a is at base p and atom b is at base p + 1.

Figure S20 :
Figure S20: The correlation in coupling strengths (kcal/mol/ Å2 ) between different mechanical compartments in TpA(AT) and TpA(TA).The molecular illustrations of {a-b} m (m=hb, st, BR and RP) are shown in the top panel.(a) k st a-b,p -k hb a-b,p plots, (b) k BR a-b,p -k st a-b,p plots and (c) k RP a-b,p -k st a-b,p plots.For m=hb and m=BR, k m a-b,p is the average of force constant over all trajectory windows for the atom pair a-b at base p.For m=st and m=RP, k m a-b,p is the average of force constant over all trajectory windows for the atom pair a-b whose atom a is at base p and atom b is at base p + 1.

Figure S21 :
Figure S21: The correlation in coupling strengths (kcal/mol/ Å2 ) between different mechanical compartments in CpG(GC) and CpG(CG).The molecular illustrations of {a-b} m (m=hb, st, BR and RP) are shown in the top panel.(a) k st a-b,p -k hb a-b,p plots, (b) k BR a-b,p -k st a-b,p plots and (c) k RP a-b,p -k st a-b,p plots.For m=hb and m=BR, k m a-b,p is the average of force constant over all trajectory windows for the atom pair a-b at base p.For m=st and m=RP, k m a-b,p is the average of force constant over all trajectory windows for the atom pair a-b whose atom a is at base p and atom b is at base p + 1.

Figure S23 :
Figure S23: The k st p -k hb p plots of polyA, polyA c , polyG, and polyG c (left) and of TpA(AT), TpA(TA), CpG(GC) and CpG(CG) (right).The linear best fit of k st p to k hb p is shown for each group of (polyA, polyG), (polyA c , polyG c ), (TpA(AT), CpG(GC)), and (TpA(TA), CpG(CG)).The values of k hb p and k st p are calculated from 1 µs all-atom MD simulations of all sequences using the OL15 force field.The other details remain the same as in the data processing of Fig. 4b.

Figure S24 :
Figure S24: The k RP p -k st p plot of polyA, polyA c , polyG, and polyG c (top) and of TpA(AT), TpA(TA), CpG(GC) and CpG(CG) (bottom).The linear best fit of k RP p to k st p is shown for each sequence motifs.The values of k st p and k RP p are calculated from 1 µs all-atom MD simulations of all sequences using the OL15 force field.The other details remain the same as in the data processing of Fig. 6a (bottom).

Figure S25 :
Figure S25:The trigram of base-to-backbone rigidity for polyA, polyG, TpA, and CpG in terms of the k hb , k st and k RP values in kcal/mol/ Å2 .All mean inter-moiety rigidities are calculated from the 1 µs all-atom MD simulation of each sequence using the OL15 force field.The other details remain the same as in the data processing of Fig.7.