Open Access Article
Archita
Agrawal
and
Alan
Saghatelian
*
Clayton Foundation Laboratories for Peptide Biology, Salk Institute for Biological Studies, La Jolla, CA, USA. E-mail: asaghatelian@salk.edu
First published on 6th March 2025
Microproteins are an emerging class of proteins that are encoded by small open reading frames (smORFs) less than or equal to 100 amino acids. The functions of several microproteins have been illuminated through phenotypic screening or protein–protein interaction studies, but thousands of microproteins remain uncharacterized. The functional characterization of microproteins is challenging due to a lack of sequence homology. Here, we demonstrate a strategy to enrich microproteins that contain specific motifs as a means to more rapidly characterize microproteins. Specifically, we used the fact that polyalanine motifs are associated with nuclear proteins to select 58 candidate microproteins to screen for transactivation function. We identified three microproteins with transactivation activity when tested as GAL4-fusions in a cell-based luciferase assay. The results support the continued use of the motif selection strategy for the discovery of microprotein function.
The development of ribosome profiling,1,6,7 computational approaches to analyze the genome,2 and proteomics8,9 paved the way for the identification of thousands of previously unknown smORFs as potentially protein-coding.4,10,11 One of the first notable examples of functional characterization of a microprotein came with the discovery of polished rice/tarsal-less (pri/tal) where an 11 amino-acid microprotein was found to be critical for development in flies.12,13 A key example in mammalian biology is myoregulin, an endoplasmic reticulum membrane microprotein that regulates calcium flux to mediate muscle performance.14 As evidence began to mount that smORFs are detected across all living organisms, the functional roles of several microproteins were individually elucidated in a broad range of biological processes.1,15 Consequently, the next big challenge presented itself as how to uncover the possible biological functions of thousands of such uncharacterized microproteins.
The guiding principle of protein-function discovery relies on sequence conservation which signifies biological importance, and provides a starting point for conducting experimental studies. As a class of proteins, microproteins are short sequences and do not align readily with high sequence conservation scores3,11,16 when queried against the canonical proteome databases. Due to these constraints, the prediction and analyses of their probable structural folds at scale by the current computational methods is challenging. Therefore, the studies to experimentally validate the possible biological activities of microproteins have been restricted to select candidates, often that show strict evolutionary conservation of their protein sequences.
It is critical to note that eukaryotic proteomes are rich in protein sequences that lack ordered three-dimensional structural folds or domains, yet possess important molecular functions. Intrinsically disordered regions (IDRs) represent 30–40% of the eukaryotic proteomes.17,18 IDRs exhibit diverse functionality, they can serve as a spacer or flexible sequences residing in between the folded domains of a larger protein, possess regulatory functions such as molecular recognition, scaffolding, and biomolecular condensate formation.17 As opposed to the well-folded protein-domains that mediate functions such as catalysis and molecular binding, IDRs offer functional advantages such as structural adaptability or induced folding, weak but specific interactions, and are often sites for post-translational modifications.17,18 A hallmark feature of IDRs is that they show weaker sequence conservation as compared to the sequences that adopt stable tertiary structures.17
Similar to such IDRs, it is postulated that microproteins may also be comprised of functional sequences despite lacking conservative sequences and definitive structural folds.16 CYREN microprotein is disordered, and immunoprecipitation studies have shown that CYREN can regulate its protein–protein interactions with Ku70/80 and several DNA damage response proteins with distinct short linear motifs (SLiMs).19,20 CYREN localizes to the nucleus and regulates the classical non-homologous end joining pathway.20,21 NBDY is an intrinsically disordered microprotein that can modulate the phase separation properties of P-bodies by its phosphorylation status.22 These case studies represent how intrinsically disordered microproteins can facilitate their molecular function by motifs and/or post-translational modifications, and it is plausible that the biochemical principles that guide IDR-functionality can be applied to microprotein sequences with unknown functions to support their characterization efforts.
We hypothesized that even though thousands of microprotein sequences might not be evolutionarily conserved across their full length, the appearance of specific sequence motifs could predispose them to certain biochemical and cellular activities. If true, this would encourage motif searches across the available microprotein databases to enable the design of focused biofunctional screens, and accelerate the functional characterization of microproteins.
![]() | ||
Fig. 1 (A) Amino acid composition of microproteins (n = 7655, microprotein database23) relative to the reference human proteome (n = 20 435, UniProtKB), the numerical values are listed in Table S1 (ESI†). (B) GO overrepresentation test for polyalanine protein-subset (n = 415) relative to the reference human proteome. Information relating to the polyalanine subset and GO analysis is presented in Tables S2 and S3 (ESI†). | ||
Since microproteins contain a higher proportion of alanine residues, we further examined the role and relevance of polyalanine motifs. We reasoned that the random probability of having a minimum of six consecutive alanine residues (6-ala peptide motif) in a protein would be infinitesimally small, and this should help isolate the sequences locally rich in alanine residues. We found 415 proteins that contain the 6-ala peptide motif in the canonical human protein dataset (Table S2, ESI†). Gene ontology analysis showed transcription regulators as the most overrepresented protein class in the polyalanine proteins subset (Fig. 1B and Table S3, ESI†), as well as positive enrichment for the nuclear cellular component (Table S3, ESI†). Additionally, we examined the database of human transcription factors enlisted by Lambert et al., 201829 and 9% of the sequences show a 6-ala motif, a four-fold increase over the expected frequency based on the reference proteome. These observations are consistent with the reports in the literature that polyalanine stretches are often found in transcription regulatory proteins.24,25,27,30 As a consequence, we hypothesized whether selecting microprotein sequences that contain the 6-ala peptide motif could enable the discovery of microproteins with transactivation function.
The microproteins of interest were sequentially named as microprotein transactivator in this study (MPTA-1 to MPTA-58, Table S4, ESI†), and were cloned as C-terminal fusions of the GAL4 DNA-binding domain (GAL4DBD) into the GAL4-vector (Fig. 2). The GAL4-vector contained Renilla luciferase to serve as an internal transfection control. Additionally, estrogen receptor-α (ERα) and p65 transactivation domain 1 (p65_TA1) were included in the experiments as positive controls. The test library was individually co-transfected with the firefly reporter vector containing an upstream 9xUAS element in HEK293T cells (Fig. 2). The DualGlo luciferase assay was used to measure the induction of firefly luciferase by each MPTA relative to the baseline GAL4-activity. The sequences with four-fold or higher induction relative to GAL4-activity in two independent experiments were identified as positive hits (Table 1 and Table S5, ESI†). This screen led to the identification of three microproteins with transactivation activity, MPTA-10 (30 aa), MPTA-17 (37 aa) and MPTA-45 (98 aa).
![]() | ||
| Fig. 2 Experimental design to screen polyalanine microproteins for transactivation activity in a cell-based luciferase assay (created with BioRender.com). Microproteins (MPTA-1 to MPTA-58) were synthesized as C-terminal fusions to GAL4DBD and the vector contained renilla luciferase as internal transfection control. The gene expression reporter vector contained a firefly luciferase with an upstream 9xUAS element for the assessment of transactivation activity. HEK293T cells were co-transfected with a test microprotein library plasmid and firefly reporter vector, and 48 hours later the DualGlo Luciferase assay was performed. The sequences are listed in Table S4 (ESI†). | ||
| Replicate 1 | Replicate 2 | Replicate 1 | Replicate 2 | ||
|---|---|---|---|---|---|
| MPTA-1 | 1.29 ± 0.07 | 2.51 ± 0.46 | MPTA-30 | 1.36 ± 0.17 | 0.17 ± 0.07 |
| MPTA-2 | 1.08 ± 0.10 | 1.17 ± 0.10 | MPTA-31 | 1.05 ± 0.11 | 0.66 ± 0.18 |
| MPTA-3 | 0.94 ± 0.14 | 0.73 ± 0.07 | MPTA-32 | 3.29 ± 0.53 | 0.58 ± 0.04 |
| MPTA-4 | 0.51 ± 0.23 | 0.32 ± 0.06 | MPTA-33 | 0.70 ± 0.22 | 1.50 ± 0.20 |
| MPTA-5 | 1.43 ± 0.33 | 1.07 ± 0.06 | MPTA-34 | 0.60 ± 0.12 | 0.31 ± 0.08 |
| MPTA-6 | 0.93 ± 0.38 | 0.60 ± 0.08 | MPTA-35 | 2.92 ± 0.66 | 1.73 ± 0.16 |
| MPTA-7 | 1.04 ± 0.30 | 0.40 ± 0.12 | MPTA-36 | 0.51 ± 0.08 | 0.32 ± 0.04 |
| MPTA-8 | 0.97 ± 0.19 | 0.83 ± 0.19 | MPTA-37 | 0.34 ± 0.05 | 0.47 ± 0.11 |
| MPTA-9 | 2.28 ± 0.31 | 1.16 ± 0.19 | MPTA-38 | 0.65 ± 0.30 | 2.22 ± 0.62 |
| MPTA-10 | 5.41 ± 0.30 | 4.00 ± 0.29 | MPTA-39 | 0.95 ± 0.15 | 0.31 ± 0.04 |
| MPTA-11 | 0.42 ± 0.07 | 0.47 ± 0.06 | MPTA-40 | 1.66 ± 0.36 | 3.27 ± 0.28 |
| MPTA-12 | 0.69 ± 0.15 | 0.10 ± 0.02 | MPTA-41 | 0.92 ± 0.30 | 0.26 ± 0.06 |
| MPTA-13 | 2.60 ± 0.52 | 3.41 ± 0.22 | MPTA-42 | 1.10 ± 0.27 | 0.26 ± 0.04 |
| MPTA-14 | 1.92 ± 0.20 | 1.59 ± 0.13 | MPTA-43 | 2.42 ± 0.54 | 2.41 ± 0.63 |
| MPTA-15 | 1.90 ± 0.26 | 0.53 ± 0.13 | MPTA-44 | 0.55 ± 0.06 | 0.56 ± 0.20 |
| MPTA-16 | 1.49 ± 0.49 | 0.63 ± 0.11 | MPTA-45 | 14.68 ± 2.65 | 9.05 ± 0.91 |
| MPTA-17 | 16.73 ± 4.38 | 12.39 ± 2.01 | MPTA-46 | 0.35 ± 0.10 | 0.19 ± 0.01 |
| MPTA-18 | 1.51 ± 0.37 | 0.28 ± 0.08 | MPTA-47 | 0.20 ± 0.05 | 0.08 ± 0.01 |
| MPTA-19 | 3.17 ± 0.82 | 1.30 ± 0.06 | MPTA-48 | 0.27 ± 0.04 | 1.16 ± 0.31 |
| MPTA-20 | 1.84 ± 0.38 | 0.73 ± 0.06 | MPTA-49 | 0.16 ± 0.04 | 2.37 ± 0.17 |
| MPTA-21 | 2.24 ± 0.21 | 0.57 ± 0.11 | MPTA-50 | 0.30 ± 0.07 | 0.25 ± 0.01 |
| MPTA-22 | 0.60 ± 0.06 | 0.17 ± 0.03 | MPTA-51 | 0.12 ± 0.06 | 0.38 ± 0.11 |
| MPTA-23 | 1.87 ± 0.13 | 0.87 ± 0.10 | MPTA-52 | 0.15 ± 0.02 | 0.22 ± 0.04 |
| MPTA-24 | 1.52 ± 0.61 | 1.83 ± 0.37 | MPTA-53 | 0.51 ± 0.09 | 0.43 ± 0.09 |
| MPTA-25 | 3.22 ± 0.62 | 1.08 ± 0.17 | MPTA-54 | 0.83 ± 0.12 | 0.36 ± 0.07 |
| MPTA-26 | 0.15 ± 0.05 | 0.21 ± 0.09 | MPTA-55 | 0.28 ± 0.05 | 0.34 ± 0.06 |
| MPTA-27 | 0.32 ± 0.06 | 0.29 ± 0.09 | MPTA-56 | 0.83 ± 0.20 | 0.20 ± 0.07 |
| MPTA-28 | 1.33 ± 0.18 | 0.26 ± 0.03 | MPTA-57 | 1.55 ± 0.25 | 0.71 ± 0.44 |
| MPTA-29 | 1.49 ± 0.88 | 0.17 ± 0.02 | MPTA-58 | 1.01 ± 0.29 | 0.78 ± 0.22 |
![]() | ||
| Fig. 3 (A) Sequence schematic showing different lengths of MPTA-17. (B) HEK293T cells were co-transfected with the respective microprotein variant and the firefly reporter vector, and the transactivation activity was assessed using the DualGlo luciferase assay. The microprotein activities from a representative experiment were plotted as fold change to baseline GAL4-activity (mean ± SD, n = 4), one-way ANOVA with multiple comparison ****p < 0.0001. Each experiment was conducted 3 times and the activity data with respective controls are presented in Table S6 (ESI†). (C) and (D) HEK293T cells were transfected individually with GAL4 or MPTA-17 variants and western blot was performed with GAL4-antibody and β-actin-antibody to detect protein expression, mock transfection without plasmid was used as a control (C) a representative blot is shown, (D) the relative protein band intensities for MPTA-variants in respective lanes were normalized to β-actin and quantified using ImageJ, presented as mean ± SD. Each experiment was conducted 3 times and data are presented in Table S6 (ESI†). | ||
We tested whether the different sequence truncations of the microproteins impacted their overall expression or stability post transfection. To test the protein expression levels of the microprotein variants, western blotting was performed using the GAL4-antibody following transfection in HEK293T cells. The N-terminal analog MPTA-17 (1–23) was present at lower levels compared to the native MPTA-17 (Fig. 3C and D and Table S6, ESI†). This variant was inactive in the transactivation assay (Fig. 3B and Table S6, ESI†), and due to its relatively lower protein level it is unclear whether this peptide region contributes to the transactivation activity of MPTA-17. The C-terminal analogs MPTA-17 (16–37) and MPTA-17 (24–37) were expressed at similar but modestly lower levels than the native protein (Fig. 3D). Importantly, the C-terminal MPTA-17 (24–37) sequence of 14-aa is able to effectively recapitulate the native protein transactivation function and is well-expressed, it is likely the key sequence region that supports the transactivation activity of MPTA-17.
![]() | ||
| Fig. 4 (A) Sequence schematic showing different lengths of MPTA-45. (B) HEK293T cells were co-transfected with the respective microprotein variant and the firefly reporter vector, and the transactivation activity was assessed using the DualGlo luciferase assay. The microprotein activities from a representative experiment were plotted as fold change to baseline GAL4-activity (mean ± SD, n = 4), one-way ANOVA with multiple comparisons ****p < 0.0001, ***p = 0.0001. Each experiment was conducted 2 times and the activity data with respective controls are presented in Table S7 (ESI†). (C) and (D) HEK293T cells were transfected with GAL4 or MPTA-45 variants and western blot was performed with GAL4-antibody and β-actin-antibody to detect protein expression, mock transfection without plasmid was done as a control (C) a representative blot is shown, (D) the relative protein band intensities for MPTA-variants in respective lanes were normalized to β-actin and quantified using ImageJ, presented as mean ± SD. Each experiment was conducted 3 times and data are presented in Table S7 (ESI†). | ||
The protein expression test for selected MPTA-45 analogs was performed by western blotting. MPTA-45 (1–75) analog showed lower expression compared to the native protein (Fig. 4C and D and Table S7, ESI†). MPTA-45 (68–98) was expressed at a higher level, while MPTA-45 (68–91) and MPTA-45 (76–98) showed relatively comparable levels to the native protein (Fig. 4C and D, Fig. S1A and B and Table S7, ESI†). A protein of lower molecular weight than expected was detected for MPTA-45, however, the quantification and comparison of the protein expression among the different analogs was done based on the observed band of the correct molecular weight. In this set of variants, MPTA-45 (68–91) was the shortest peptide with transactivation activity that was well-expressed relative to the native MPTA-45, and contained the alanine stretch.
In this study, we designed a transactivation screen to find functional microprotein sequences by a motif-selection approach. Based on the significant enrichment of alanines observed in human transcription factors and nuclear proteins, we formulated our hypothesis to select candidate microproteins with a 6-ala motif and search for bioactive sequences. We successfully identified three microproteins possessing transactivating activity, MPTA-10, MPTA-17 and MPTA-45 (5% hits) with the GAL4-transactivation assay from a pilot library of 58 microproteins. These sequences show short hydrophobic clusters, one or more tryptophan residues and a few polar residues (Fig. S2, ESI†), which align with the expected sequence features of transactivating peptides.33–35 Prior transactivation screens noted that no specific consensus motif is deterministic of the transactivation function, though some features such as the presence of hydrophobic residues in short clusters interspersed with polar or charged residues are more common.33–35 They are in agreement with the idea that transactivation peptides within the transcription factors typically reside in IDRs and are short, <30 amino acids.33–35
A random DNA library encoding for 67
000+ peptides of ≤20 aa length was tested as a fusion to the yeast heat shock factor-1, to identify transactivating sequences and yielded 1% positive hits.34 Additional high-throughput studies conducted to identify and characterize transactivating peptide sequence properties have noted positive hit rates ranging from 0.1% to 4%.33,40 The direct comparison of the positive hit rates observed in large-scale studies with randomized peptide sequences and a proof-of-concept screen presented here is not ideal, however, it is safe to state that a polyalanine motif-selection strategy to discover transactivation peptides from microprotein sequences is no worse than such endeavors.
Polyalanine domains in transcription factors have been defined in the context of at least nine human congenital diseases,24,25,41 where the expansion of the polyalanine tract leads to loss-of-function or abnormal function of the encoded protein. HOXD13 is one example where the expansion of native polyalanine tract by additional 7–14 alanines results in synpolydactyly syndrome.42,43 The elongated alanine stretch in the disease variant affects the biomolecular co-condensate formation of HOXD13 with the transcriptional coactivator mediator.44,45 Polyalanine sequences can influence the biophysical properties of the proteins including their secondary structure, protein aggregation, phase separation and cellular localization.24,46,47 Alanine-rich sequences were initially recognized in the transcriptional repressors of insect proteins such as Kruppel48 and Hox,49 and FEV50 is one example of a human repressor protein. The molecular rationale to justify the high prevalence of polyalanine stretches in transcription regulatory proteins across eukaryotic proteomes remains to be fully defined, and no generalized molecular mechanism of their specific function has been proposed.24,25,27,28 However, there is sufficient evidence of enrichment of alanine repeats in nuclear as well as transcription regulatory proteins to warrant continued studies on this subject.
Structure–activity testing for MPTA-17 showed that MPTA-17 (24–37), did not require an alanine stretch for transactivation (Fig. 3B), while the MPTA-45 (68–91) contained the alanine stretch to assist its activity (Fig. 4B). Further work will be required to understand how alanine rich motifs may contribute to the transactivation function directly or indirectly, by affecting the biophysical characteristics or subcellular localization of microproteins in the cellular context. We note that some microprotein sequences or variants may be more prone to proteolytic degradation, or otherwise have variable expression profiles when transfected in cellular assays which could lead to false negatives during preliminary screening. Protein sequences terminating with C-terminal alanine repeats have been demonstrated to target proteins for proteolytic degradation by E3 ligases,51,52 and it is possible that MPTA-17 (1–23) and MPTA-45 (1–75) with C-terminal alanine residues are more prone to degradation in comparison to the respective native sequences (Fig. 3C, D and 4C, D). Additional studies will be required to ascertain whether the microproteins identified in the synthetic GAL4-transactivation assay could be endogenous regulators of transcription.
IDRs are known to possess SLiMs of 3–10 residues to facilitate their functionality.17,53 The annotation of functional SLiMs in the eukaryotic linear motif (ELM) database has been rising steadily.53 It is proposed that such sequences can rapidly evolve by the use of motifs or specific sequence features instead of absolute sequence homology at the level of amino acids.17 Microproteins are thought to represent products from de novo gene creation, are less conserved than canonical genes54,55 and also possess features similar to IDRs. A high-throughput protein interaction screen on a peptide matrix was conducted for peptide sequences derived from microproteins to identify their protein interaction partners, and led to the identification of dozens of microprotein interactions that are likely to be governed by SLiMs.55 It is predictable that with the growing microprotein database availability and illustrations of how SLiMs in microproteins can drive molecular functions, the motif-guided screening approach presented here can be expanded to design broader test libraries to explore the functional space of microproteins.
000 cells per well. The next day, the cells co-transfected with a test GAL4-MPTA plasmid, or respective control plasmid GAL4, GAL4-ERα, p65-TA1 and firefly luciferase reporter plasmid using lipofectamine LTX reagent with PLUS reagent (Invitrogen #15338100) in the Opti-MEM I reduced serum medium (Gibco #31985062). Each treatment was done in quadruplicates. 24 hours post-treatment, the selected wells treated with GAL4-ERα as the positive control were supplemented with 30 nM Estradiol (Sigma-Aldrich #E1024, CAS No. 50-28-2). 48-Hours post-transfection, Dual-Glo® luciferase assay system (Promega #E2940) was performed as described. The medium was aspirated, 75 μL per well PBS and 75 μL DualGlo reagent were added, and the plate was incubated for 15 minutes at room temperature on a shaker. The cell lysate solution (120 μL per well) was transferred to a 96-well black plate, and firefly luminescence was recorded with a BioTek synergy plate reader. The StopGlo reagent (60 μL per well) was added to the plate and incubated for 15 minutes at room temperature on the shaker, and Renilla luminescence was recorded.
000× g for 25 minutes at 4 °C, and the supernatant was collected. The total protein concentration of each lysate sample was determined using Pierce Microplate BCA Protein Assay Kit – reducing agent compatible (Thermo Scientific #23252). Samples were prepared for gel-loading at equal protein concentration (22 μg per lane) with bolt LDS sample buffer (Invitrogen #B0008) and heated for 10 minutes at 70 °C. A Bolt 4-12% Bis-Tris Plus WedgeWell (Invitrogen #NW04120BOX) gel was run with 1× Bolt MES SDS buffer (Invitrogen #B0002) with the samples and a protein ladder (Biorad #1610377). The protein was transferred to a PVDF membrane (Invitrogen #IB24002) with the iBlot2 system. The membrane was blocked using intercept blocking buffer TBS (Licor #927-60001) and incubated with GAL4(DBD) antibody (SantaCruz #sc-510 RK5C1) overnight at 4 °C. The membrane was washed three times with TBST (with 0.1% Tween-20), followed by Goat anti-Mouse IgG (IRDye 800CW, Licor #926-32210) incubation for 2 hours at room temperature, and washed 5 times with TBST. The membrane was imaged using an Odyssey CLx scanner. The same membrane was incubated with β-actin rabbit monoclonal antibody (Licor #926-42210) followed by Goat anti-Rabbit IgG (Alexa Fluor 680, Invitrogen #A21109) and imaged. Each plasmid transfection was done three times, and respectively three western blot experiments were performed unless noted otherwise. MPTA-45 (76–98) plasmid transfection and respective western blot was performed once.
Gene Ontology analysis for the polyalanine proteins subset was done by statistical overexpression test with Panther Protein Class and GO cellular component complete (https://www.pantherdb.org/).
For the DualGlo luciferase assay performed in 96-well plates, the firefly luminescence signal was divided by renilla luminescence for each well to obtain the transactivation activity ratio, which was then normalized with baseline GAL4 Activity. Data obtained was analyzed in this manner consistently across all experiments and the raw data from one experiment are shown in Table S8 (ESI†). The data are presented as mean ± SD (n = 4) for each plasmid treatment, and the experiment was performed two or three times as noted. One-way ANOVA with multiple comparisons (Dunnett's) statistical test was performed using GraphPad Prism 10. All the graphs were plotted using GraphPad Prism 10.
ImageJ was used to analyze the relative band intensities for western blots. Rectangular boxes around the desired bands were made, and intensity peak plots were obtained with the Analyze Gels option. The ratio of GAL4 or GAL4-MPTA band intensity to the respective β-actin in individual lanes was calculated. The obtained ratio was normalized to GAL4 to represent the relative band intensity of each tested GAL4-MPTA construct and data are presented as Mean ± SD from three experiments unless noted otherwise.
Table of content – Created in BioRender. Agrawal, A. (2025) https://BioRender.com/m76s319, Fig. 2 – Created in BioRender. Agrawal, A. (2025) https://BioRender.com/i24z440.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4cb00277f |
| This journal is © The Royal Society of Chemistry 2025 |