Dinesh
Sharma
a,
Kopal
Sharma
a,
Akhilesh
Mishra
a,
Priyanka
Siwach
b,
Aditya
Mittal
a and
B.
Jayaram
*ac
aSupercomputing Facility for Bioinformatics & Computational Biology, Kusuma School of Biological Sciences, Indian Institute of Technology, Delhi, India
bDepartment of Biotechnology, Chaudhary Devi Lal University, Sirsa, Haryana, India
cDepartment of Chemistry, Indian Institute of Technology, Delhi, India. E-mail: bjayaram@chemistry.iitd.ac.in
First published on 10th February 2023
Genomes of most organisms on earth are written in a universal language of life, made up of four units – adenine (A), thymine (T), guanine (G), and cytosine (C), and understanding the way they are put together has been a great challenge to date. Multiple efforts have been made to annotate this wonderfully engineered string of DNA using different methods but they lack a universal character. In this article, we have investigated the structural and energetic profiles of both prokaryotes and eukaryotes by considering two essential genomic sites, viz., the transcription start sites (TSS) and exon–intron boundaries. We have characterized these sites by mapping the structural and energy features of DNA obtained from molecular dynamics simulations, which considers all possible trinucleotide and tetranucleotide steps. For DNA, these physicochemical properties show distinct signatures at the TSS and intron–exon boundaries. Our results firmly convey the idea that DNA uses the same dialect for prokaryotes and eukaryotes and that it is worth going beyond sequence-level analyses to physicochemical space to determine the functional destiny of DNA sequences.
Over the years, scientists have been constantly working to develop various computational algorithms for the annotation of different DNA elements. Among the various sites, promoters and exon–intron boundaries are a prime focus of the present research. Promoters are one of the genome's most essential components. These elements commence the transcription process by primarily binding to the RNA polymerase (RNAP). Together with the bound polymerase, these elements undergo several changes in their overall architecture.7 The promoter's function, however, is not restricted to starting transcription. Also, these regions aid in appropriately identifying and confirming predicted genes in genome annotation. Sequence-based methods rely on capturing a consensus sequence around promoters and such approaches have been moderately successful.8,9 Machine learning-based promoter predictors are developed through training over extensive sequence data. These tools offer good accuracy but are highly genome specific.10–16 Despite the many insights gained over the years from such investigations, the creation of a promoter prediction tool capable of high performance across diverse organisms is still a long way off.6
Eukaryotes are complex organisms that have a membrane-bound nucleus and cell organelles. In these organisms, a gene is composed of two essential elements, viz., introns and exons. Introns are the noncoding regions of a gene (or mRNA) that are removed before the maturation of the mRNA through a process called splicing. The segments of the DNA (or mRNA) that finally become part of the mature mRNA and thus code for the proteins are exons.7 Like promoter identification, detecting accurate intron–exon architecture within a gene is essential and has recently received significant attention in eukaryotic genome annotation.17 The primary focus of categorization was on the splice sites (SSs) and junction sequences. Conventionally, it includes a G–T nucleotide pair at the 5′ end of the intron, and an A–G nucleotide pair at the 3′ end.7,17 With the continuous characterization of SSs, many sequence alterations occur at these sites. Further, the identification of the exon–intron junctions becomes challenging due to the presence of alternative splicing. These challenges significantly reduce the reliance on SSs for detecting exon–intron boundaries.18 Computational algorithms have been developed to recognize these junctions within the genes. Some tools use a score to predict the splice site, calculated against matrices created from vast sets of splice sites.19–22 Other approaches, such as the Genscan23 and Genomescan,24 compare the splice site sequence signal (at the intron–exon boundaries) to known protein sequences in an integrated manner. Programming-based tools for ab initio gene prediction like Genomewise,25 Augustus,26 Fgenesh,27 GeneParser,28 GeneID29 are built using advanced computational models like HMM or Dynamic model. These methods learn the characteristics of the huge training sequences and produce good results for a particular species or organism. For a sequence not in the training dataset, their accuracy decreases considerably.
Since there exists no universal model for the identification and characterization of these genomic elements, it is necessary to come up with a chemistry-based approach to annotation. In our previous research communications,6,17,30–34 we have elucidated the importance of the structural and energy-based profiling of the DNA elements. Likewise, other groups have also focused on capturing the structural and energy signals of various regions by considering properties like free energy, A-philicity, curvature, bendability, inter-BP properties, and DNA duplex stability under stress.80–84 Through these structural and energetic parameter analyses, it became clear that there are unique structure profiles and energy profiles for every component within the DNA. These profiles are universal for a particular element and thus can be utilized for their efficient recognition. Though the di-nucleotide-based parameters provide a specific structure and energy description of the genomic sites in consideration, they do not take into account the neighbouring effect of the flanking nucleotides. Thus, we miss out on the adjacency effects. We have utilized higher nucleotide steps in the present study, advancing our previous work.
Compared to our previous research,6,17 which utilizes the X-ray-based dinucleotide data of B-DNAs to profile the TSSs in prokaryotes and exon–intron boundaries in humans, in this present work, we have taken into consideration the neighbouring effects by mapping structural and energetic parameters of all the unique tri and tetra-nucleotide steps.6 The biophysical features utilized in this research have been computed through microsecond-long molecular dynamics (MD) simulations on all the possible tri- and tetranucleotide steps. Unlike the dinucleotide-based study, the total number of B-DNA structures currently available in the Nucleotide Database (NDB) does not describe the instances of all possible tri- and tetranucleotide steps. Atomistic molecular dynamics (MD) simulations used in this study are currently the only way to obtain robust and transferable parameters.35 The profiling distinctly delineates the TSSs in both prokaryotes and eukaryotes and the exon–intron boundaries of all the protein-coding genes in humans. These increased nucleotide steps incorporate neighboring effects from the adjacent nucleotide and have provided us with new insights into the structure and energetics of DNA, different from the dinucleotide-based characteristics. The current study is concerned with the recognition and characterisation of the physicochemical signals at the TSS and exon–intron transitions. Our future aim is to develop a method for predicting these sites; hence, it is currently not viable to benchmark the current prediction algorithms. However, a thorough comparison of the three dimensions, viz., sequence, energy and structure-based analysis made here, will strengthen the importance of using the physicochemical-based approaches as a ubiquitous model for annotation.
For this study, the influence of the neighbouring step has been accounted on the base pair and backbone structure/dynamics, and thus for more accurate calculations, these parameters (nine backbone and four BP-axes) have been mapped over all the possible trinucleotides (64), while the six intra-BP and three energy parameters were considered for all the unique trinucleotides (32). The inter-BP parameters have been computed over all the unique tetra-nucleotide steps (136) (represented in the parameter details file, ESI†).78,79 The numerical values for the structural parameters are extracted from the μ-second long MD simulations, while in-house software is used to calculate the energy parameters. The sequence datasets comprise 16519 and 197356 primary TSS sites for promoter analysis in prokaryotes and eukaryotes, respectively. For a similar characterization of exon–intron in the human genome, we have used ∼0.33 million exon–intron boundaries for the exon start site and exon end site. Users can download the data from: https://www.scfbio-iitd.res.in/Tri_Tetra/data.html. The structural and energy-based descriptions of these DNA elements give us unique signatures for their characterization. To establish the universality of these biophysical signals, a fair comparison at the same sites has been made with a few widely used sequence-based genome annotation approaches (described in the Methods section). Our results firmly convey that the structural and energy signal patterns can undoubtedly be hidden signals within the DNA of both the prokaryotes and eukaryotes through which the genome conveys information about its functional features.
Phylum | Species | Genome size (Mb) | Number of primary TSS | Number of CDS |
---|---|---|---|---|
Euryarchaeota | Methanolobus psychrophilus | 3.07 | 1463 | 355 |
Thermococcus kodakarensis | 2.08 | 1248 | 208 | |
Halofrex volcanii | 3.93 | 1723 | 425 | |
Actinobacteria | Mycobacterium tuberculosis H37Rv | 4.38 | 1440 | 626 |
Streptomyces coelicolor A3 | 9.05 | 2771 | 1201 | |
Proteobacteria | Helicobacter pylori | 1.63 | 227 | 227 |
Salmonella enterica serovar Typhimurium | 5.067 | 1871 | 624 | |
Escherichia coli | 5.17 | 1222 | 577 | |
Pseudomonas aeruginosa PA14 | 6.58 | 2118 | 853 | |
Firmicutes | Bacillus amyloliquefaciens | 3.95 | 1062 | 393 |
Chlamydiae | Chlamydia pneumonia CWL029 | 1.22 | 357 | 198 |
Cyanobacteria | Synechocystis sp. PCC6803 | 3.57 | 430 | 531 |
Total | 16519 | 6218 |
Group | Species | Chromosomes | Number of TSS | Number of CDS |
---|---|---|---|---|
Budding yeast | Candida albicans | Chr1, Chr2, Chr3, Chr4, Chr5, Chr6, Chr7 and ChrR | 26545 | 2329 |
Kluyveromyces lactis | ChrA, ChrB, ChrC, ChrD, ChrE and ChrF | 34655 | 1910 | |
Lachancea kluyveri | ChrA, ChrB, ChrC, ChrD, ChrE, ChrF, ChrG and ChrH | 17411 | 2086 | |
Naumovozyma castellii | Chr1, Chr2, Chr3, Chr4, Chr5, Chr6, Chr7, Chr8, Chr9 and Chr10 | 19189 | 2144 | |
Saccharomyces cerevisiae | ChrI, ChrII, ChrIII, ChrIV, ChrV, ChrVI, ChrVII, ChrVIII, ChrIX, ChrX, ChrXI, ChrXII, ChrXIII, ChrXIV, ChrXV and ChrXVI | 17925 | 2223 | |
Saccharomyces paradoxus | ChrI, ChrII, ChrIII, ChrIV, ChrV, ChrVI, ChrVII, ChrVIII, ChrIX, ChrX, ChrXI, ChrXII, ChrXIII, ChrXIV, ChrXV and ChrXVI | 29690 | 2184 | |
Yarrowia lipolytica | ChrA, ChrB, ChrC, ChrD, ChrE and ChrF | 27793 | 2365 | |
Fission yeast | Schizosaccharomyces pombe | ChrI, ChrII and ChrIII | 24148 | 1474 |
Total | 197356 | 16715 |
To characterize the intron–exon boundary junctions, the human genome annotation file was retrieved from the GENCODE database. From this file, the start and end positions of 328368 exons from all the protein-coding genes were considered. A series of exon–intron boundary sequence datasets were constructed using these genomic locations. For the sequence study, two datasets, each containing sequences of 51-nucleotide length, were created using the exon-start and exon-end, placed at the 26th position, respectively. These datasets were organised based on the location of the genes on chromosomes (Table S2, ESI†) and were then used to identify a consensus sequence around the positions of concern. Different from the sequence datasets, two new datasets were created for the structural and energy-based characterization of the exon start and exon end regions, each containing 328368 exon–intron boundary sequences. Dataset I had sequences of 401 nucleotides in length; these sequences were created by extracting 200 nucleotides upstream (representing the exon sequence) and downstream (representing the intron sequence) from the exon end position placed at 0. Similar to Dataset I, Dataset II was generated by considering the exon start positions; these sequences represent intron and exon sequences from −200 to −1 and +1 to +200, respectively. For the control dataset, we extracted sequences with similar lengths from the middle region of the exons that were longer than 1000 nucleotides (30140 out of 328368).
All relevant sequence datasets were analyzed to obtain the structural and energy profiles corresponding to the TSS in prokaryotes and eukaryotes and intron–exon interfaces in the human genome. These datasets include the 1001 nucleotide long sequences containing TSS in prokaryotes and eukaryotes and the 401 nucleotide long sequences of Dataset I and Dataset II obtained from the exon-end and exon-start sites, respectively. A sliding window of one nucleotide was used to cover the whole sequence. The values of the trinucleotide and tetra-nucleotide parameters were utilized at each sliding window to convert the sequence into 28 numerical series. These numerical profiles of all sequences relating to a specific parameter were averaged for each position after applying a sliding window of 25 base pairs. A similar methodology was applied to the control sequences (CDS) in each category. The resulting 28 averaged profiles (corresponding to each parameter) were then used to plot and visualize the structural and energetic changes occurring around the TSS (in prokaryotes and eukaryotes) and exon–intron boundary sites.
For both prokaryotes and eukaryotes, respective sequence datasets containing TSS were considered for each parameter. Based on the location of the pattern observed in the average plot of a particular parameter, positive (TSS) and negative (CDS) vectors were created from the individual sequences. For the positive vector, a window of 100 spanning −50 to +50 with TSS at 0 was considered. For the negative vector, a similar-sized window was taken after traversing 200 nucleotides downstream of the TSS.
In parallel, for a threshold analysis on the intron–exon boundaries, discrete sequences of the exon-start and exon-end datasets for each parameter were used. A 60-nucleotide long vector spanning −30 to +30 was extracted around the exon start and end positions to create the positive set for the respective group. To create a negative set corresponding to each position and sequence within both exon-start and exon-end datasets, we traversed 150 nucleotides from the positive position towards the exon region and extracted a similar length vector. For each pair of positive and negative sets, the area enclosed between them was compared with different threshold values for each parameter in the above-mentioned respective datasets. The idea is that those pairs in which the area confined within the vectors are small, i.e., less than two standard deviations from the mean, and will have indistinguishable signals but the remaining pairs will have distinct profiles with significant magnitudes. As a result, signals that fulfil the two standard deviation threshold criteria indicate that a TSS signal or exon–intron boundary signal is present in those specific sequences. The area under the curve calculations and the evaluation of thresholds are specified in the methodology in S1a and S1b and in Fig. S12–S15 (ESI†).
Fig. 1 Sequence consensus for Helicobacter pylori, a prokaryote (as obtained using WebLogo software), from position −95 to +5 with respect to TSS at 0. Some consensus is evident at position −10. However, on comparing the sequence consensus from all 12 organisms (ESI†), no specific pattern emerges. |
Our Lab has been investigating the energetics of DNA for the past 20 years, and we have demonstrated the importance of solvation, stacking, and hydrogen bond energy in identifying the distinct genomic functional units.30–34,52,68,69 Apart from the energy parameters, in recent years, we have explored the complete structural profile of DNA in and around various important motifs by taking into consideration the individual parameters belonging to four major DNA structural categories.6,17,30 For each energy and structural parameter, individual TSS and CDS sequences from respective organisms were converted to their numerical profiles; these were then averaged at each position and were used for plotting (Fig. 2 and 3). Fig. 2 shows that for all the parameters, a unique pattern is observed around the TSS (present at the 501th position and marked as 0 on the abscissa in the graphs) in comparison to the CDS and the extreme upstream and downstream regions of the TSS sequences for Helicobacter pylori (results for the remaining species are available in Fig. S2, ESI†). Fig. 3 is the combined graph for the structural and energy patterns obtained for all the sequences from all 12 organisms. These results support the scientific theories stating that the overall stability of DNA sequences, both evolutionary and thermodynamic, decreases near the promoter region.30,70–74 The energy profiles in Fig. 2 and 3 show that the hydrogen bond energy is increasing, thereby reducing the melting temperature of DNA at the TSS site. The graph of stacking energy shows that the DNA at the TSS region becomes stiffer to provide a stable platform for RNA polymerase binding. The solvation energy is reduced as DNA makes a tertiary structure near the TSS region. A similar trend is observed for the structural parameters while proceeding toward the TSS region. Out of the total 25 structural parameters, an overall rise at or around the TSS is noticeable for the 10 parameters (Alpha, Gamma, Zeta, Amplitude, Twist, Shear, Opening, X-displacement, and Tip). The other 15 properties (Beta, Delta, Epsilon, Chi, Phase, Shift, Slide, Tilt, Roll, Stagger, Buckle, Propel, Y-displacement, and Inclination) show a decrease in their pattern at the TSS region. Different from our previous study related to di-nucleotides (Fig. S3, ESI†), here, 11 parameters show a different pattern of values upstream and downstream of the TSS, which may be a result of the neighbouring effect, which was not pronounced in the dinucleotide motif dataset; this may be helpful in better predicting the TSS sites.6
Fig. 2 Structure and energy profiles of Helicobacter pylori, a prokaryote. TSS sequences are shown in green, whereas red lines represent CDSs. The numeric value of the parameter is represented by the ordinate, while the nucleotide position relative to the TSS is shown by the abscissa. Parameters are represented in the parameter details file (ESI†). The correlation among the parameters (for the entire dataset) and the important features are present in the ESI.† |
Fig. 3 Normalised and combined structure and energy graph for all sequences (16519) containing TSS from 12 prokaryotic organisms. |
To strengthen the notion, we performed a similar sequence and biophysical profiling of the promoter sequences from eukaryotes. For the analysis, we considered 7 budding yeasts and 1 fission yeast. A methodology similar to the prokaryotic promoter characterization was followed. Considering the TSS position at 0, to obtain the consensus, we extracted 101 nucleotide length sequences from position −80 to +20. The positions here are based on the promoter architecture in yeast75 and thus differ from what we considered for the prokaryotes. Using the Weblogo3 software, we obtained the consensus for Kluyveromyces lactis (Fig. 4, consensus for the rest of the species is in Fig. S4, ESI†). These figures show little or no consensus at the promoter. Just like the prokaryotes, some consensus does occur at the organism level and fades out when all the sequences are considered. The consensus sequence-based approach is thus organism-specific and fails to establish a universal signal for promoter profiling. Structural and energy-based characterization of the eukaryotic promoter sequences reveals distinguishing patterns around the TSS in all species (Fig. 5, 6 and Fig. S5, ESI†). Fig. 5 depicts the structure and energy profile of Kluyveromyces lactis, while Fig. 6 shows the trend specific to each parameter, mapped over the combined TSS sequences and compared to the CDS sequences from all the organisms. For each parameter, a specific pattern is evident at the TSS. A sharp increasing or decreasing signal at the TSS following a low intense signal around −300 to −400 can be seen for each feature. It can be hypothesized that these characteristic signals are due to the presence of important motifs prior to the actual promoter; however, a thorough investigation is essential to validate the finding. Comparing the signals from prokaryotes with those of eukaryotes suggests that the trend for each parameter is not the same. Here, 11 structural parameters (Alpha, Beta, Epsilon, Zeta, Amplitude, Slide, Twist, Shear, Stretch, Opening, and X-displacement) show decreasing behaviour at the TSS. An increasing nature at the TSS is evident for the rest of the 14 parameters (Gamma, Delta, Chi, Phase, Shift, Rise, Tilt, Roll, Stagger, Buckle, Propel, Y-displacement, Inclination, and Tip). This different behaviour of parameters can be attributed to the complex nature of eukaryotic promoters, which in the case of prokaryotes, have minimal motifs and regulatory elements; further investigation is necessary to comment on this contrasting behaviour of parameters. With the idea that a signal is said to be indispensable for usage only if it is captured at the basal level, we characterized the sequences on the chromosome level. Fig. S6 (ESI†) indicates that the structural and energy signals are also evident at the chromosome level in all the yeast species. Since yeasts are simple unicellular eukaryotic organisms that form a connecting link between the prokaryotes and the higher eukaryotes, it is necessary to characterize a multicellular higher eukaryote to see the pertinence of our approach. Caenorhabditis elegans is a common eukaryotic multicellular experimental model in biology.76 For the structural and energy characterization of this multicellular eukaryote, we obtained the potential TSS sites for all the chromosomes of Caenorhabditis elegans77 (TSS site information is in Table S7, ESI†). The CDS sites were also retrieved from the respective genome annotation files. Following the biophysical-based prokaryotic and yeast promoter characterization methodology, these sites were used for obtaining the 1001 nucleotide length sequences. The structural and energy profiling at the TSS of Caenorhabditis elegans for the combined sequences and chromosome sequences are presented in Fig. 7 and Fig. S7, S8 (ESI†). The signals are very smooth and the parameter plots reveal that the structural and energy profiles change at the TSS. The weak intensities of the signal can be attributed to the complexity of the higher eukaryotic promoter sites or the lower TSS sites known and considered for the study. However, the presence of a pattern corresponding to each parameter at the TSS sites indicates the robustness of the characterization approach. Based on the various results from the promoter characterization in both prokaryotes and eukaryotes, the structural and energy-based characterization approach outperformed the consensus sequence-based approaches, which tend to be organism-specific and non-universal. The promoter characterization for both the prokaryotes and the eukaryotes delineates the fact that several sequence variations occur at the TSS sites, and these cannot be followed for their efficient recognition. Despite sequence alterations at these sites, the physicochemical profiles for these regions are conserved and can be exploited for comprehensive annotations. Since the figures presented so far are the average plots (for all sequences throughout all organisms in prokaryotes and eukaryotes, all sequences at the organism level in prokaryotes and eukaryotes, and all sequences at the chromosome level in the case of eukaryotes only), it is crucial to know the comprehensiveness and sensitivity of this study on individual sequences. On applying the threshold methodology as explained in the Methods section, it was observed that for the specified position, where the trend was evident in the averaged plots, >98% of sequences in both prokaryotes and eukaryotes followed a similar trend. These results are detailed in Tables S3 and S4 and Fig. S12 and S13 (ESI†), with the graphs depicting the area enclosed for each pair of TSS and CDS vectors from prokaryotes and eukaryotes (a distribution plot of the area calculated for each pair of TSS vector and CDS vector is infeasible given the large volume of data).
Fig. 4 Sequence consensus for Kluvveromyces lactis, a eukaryote (using WEbLogo software), from position −80 to +20 with respect to the TSS at 0. No specific consensus is evident around the TSS for any species (ESI†). |
Fig. 5 Structure and energy profile of Kluyveromyces lactis, a eukaryote. TSS sequences are shown in green, whereas red lines represent CDSs. The numeric value of the parameter is represented by the ordinate, while the nucleotide position relative to TSS is shown by the abscissa. Parameters are represented in the parameter details file (ESI†). The correlation among the parameters (for the entire dataset) and the important features are present in the ESI.† |
Fig. 6 Normalised and combined structure and energy graph for all sequences (197356) containing TSS from the 8 eukaryotic species. |
Fig. 7 Normalised and combined structure and energy graph for all sequences (5561) containing TSS from Caenorhabditis elegans. |
Using the Jalview software and the annotation dataset acquired from GENCODE, we conducted a consensus analysis of 328365 human exon start/end sites. To achieve a particular sequence consensus, we characterized the exon start and exon end regions of a randomly selected gene from each chromosome. The results of the consensus analysis are presented in Fig. S8 (ESI†). From the results at the exon start and exon end, it is evident that there is a consensus, most likely in the form of a trimer or pentamer unit. To reveal these consensuses, we carried out trimer and pentamer motif frequency analyses for the 51-nucleotide-long sequence dataset by placing the exon start/end site at the 26th position (considered as the 0th position for reference).
For the trimer motif analysis, we considered −1, 0, +1 and −2, −1, 0 positions and for the pentamer −2, −1, 0, 1, 2 and −4, −3, −2, −1, 0 positions were selected. From the frequency analysis of trinucleotide motifs (Fig. 8), the occurrences of TAA and TGA were higher at the −1, 0, and +1 positions than any other trimer, while GTA and GTG were higher at the −2, −1, and 0 positions of the exon-end sites, confirming the conventional wisdom of the occurrence of these nucleotides at the said sites.7–17 These motifs undoubtedly have the potential to aid in the prediction of exon-end sites. However, taking into consideration the total size (328364) of the data, the frequency of occurrence is not high, and further, there is no consistent motif to rely on. Contrary to this, trinucleotide motif-based exon-start site analysis at position −1, 0, 1 clearly showed a high occurrence of GGT trimer, while at position −2, −1, 0, there was a higher occurrence of the AGG trimer motif; the frequency, however, is still low considering the size of the data. A similar frequency analysis was performed at both sites for two pentanucleotide motifs with positions −2, −1, 0, 1, 2 and −4, −3, −2, −1, 0. The results show (Fig. 9) that the motif AGGTA (−4, −3, −2, −1, 0) has a higher frequency, and is better for predicting the exon end sites. At the exon start site, there was again a higher occurrence of the AGGTA pentanucleotide for the motif −2, −1, 0, 1, 2, facilitating the prediction of the concerned sites and confirming the conventional belief of the occurrence of the A–G nucleotide at the 3′ end of the intron.7–17 To gain deeper insight into the nucleotide sequence pattern frequency at the exon–intron boundaries, an undecamer position motif extending from −5 to +5, with the exon start/end site positioned at 0, was considered (Fig. 10). The probability of the occurrence of A, T, G and C at each position was obtained. We found that G, G, and A were present with more than 50% probability at positions −4, 0 and 1, respectively, at the exon start. At the exon end, at only the −2 position, we had a greater than 50% occurrence of G. From the sequence-based analysis, various trinucleotide and pentanucleotide motifs occur frequently at the exon-start and exon-end regions. However, there was no specific consensus at these sites, and the occurrence is not uniform at the exon-start and exon-end sites. Since we could not find specific motif sequences with an absolute high consensus, we extended our study to the structural and energetic profiling of the exon–intron boundary elements by considering two separate datasets for exon start and exon end positions, respectively.
Using the trinucleotide and tetranucleotide parameter table, the 328368-nucleotide sequences in both the exon-start position dataset and the exon-end position dataset, just like the promoter sequences, were converted to 25 structural and 3 energy numerical strings. For each parameter, these numerical profiles were averaged over all sequences and plotted. From the structure and energy description provided in Fig. 11 and 12 for exon start and exon end, respectively, the hydrogen bond energy showed a sudden increase in its value followed by a crest. It can be inferred from this trend that the structure at the boundary position initially became unstable and then restores its balance with the passage of the start/end sites. The stacking energy showed a peak at the boundary junction, thus decreasing the stiffness of DNA and making it more flexible. A sharp drop in solvation energy implies the formation of a secondary structure at the exon–intron boundaries. Contrary to these observations in the energy parameters, the middle portion of the exons (in red) has a constant energy profile throughout their lengths. These energy changes at the exon–intron junctions support the notion that the boundary elements are involved in the formation of a secondary structure in the RNA space, which facilitates the splicing event, and the DNA analyses here anticipate these events.
Fig. 11 Structure and energy profiles at exon start sites (for all 328364 sequences). Sequences containing exon start sites are shown in green, whereas red lines represent CDSs. The numeric value of the parameter is represented by the ordinate, while the nucleotide position relative to the exon start is shown by the abscissa. Parameters are presented in the parameter details file (ESI†). The correlation among the parameters (for the entire dataset) and the important features are presented in the ESI.† |
Fig. 12 Structure and energy profiles at exon end sites (for all 328364 sequences). Sequences containing exon ends are shown in green, whereas red lines represent CDSs. The numeric value of the parameter is represented by the ordinate, while the nucleotide position relative to the exon end is shown by the abscissa. Parameters are represented in the parameter details file (ESI†). The correlation among the parameters (for the entire dataset) and the important features are presented in ESI.† |
Since energy changes are not entirely responsible for bringing out the changes occurring at various sites in the genome, it is their union with the structural features that bring about any functional changes within the DNA. The structural profile of the exon start/end sites also supports the notion of the formation of a secondary structure. From the plots, it is evident that for all the parameters, there is a signature pattern at the exon start and end sites (in green) in contrast to the middle region of exons (represented in red), which mostly have a constant nature. Compared to the gradual rise and fall, as seen for most of the parameters near TSS regions, a combination of sharp peaks and clefts is observed at the intron–exon boundaries. For the 13 parameters (Alpha, Beta, Gamma, Chi, Slide, Shear, Buckle, Opening, Y-displacement, Inclination, Twist, Stretch, and X-displacement), a peak followed by a crest is evident from the sequences containing exon–intron boundaries and a crest followed by a peak for the remaining 12 parameters (Delta, Epsilon, Zeta, Phase, Amplitude, Tilt, Stagger, Propel, Tip, Rise, Roll, and Shift). An absolute rise and fall were not observed for any of the parameters. The adjacency effect is evident for some of the parameters on comparing the exon start and exon end profiles with the dinucleotide-based profiles (Fig. S11, ESI†). These observations highlight the fact that the overall structural change is sudden at the boundary element of the exon and intron, and this change is not carried over larger distances. The observation made from all the physicochemical profiles emphasizes the fact that the intron–exon boundary elements are solely involved in the splicing event, and the structural and energy changes occurring in these regions facilitate the event. From all the graphs, it is apparent that the patterns of these parameters are highly similar at both the exon start and end sites. The threshold analysis results for the exon–intron and intron–exon junctions are presented in Tables S5, S6 and Fig. S13, S14 (ESI†). The analysis shows that for each parameter, the signal is evident for >98% of sequences and thus supports the notion that the signals are universal.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2cp04820e |
This journal is © the Owner Societies 2023 |