Open Access Article
Ewerton Cristhian Lima de Oliveira
a,
Juliana Auzier
b,
Gabriel Pereira Coelho
c,
Lidiane Diniz do Nascimento
c,
Anderson Henrique Lima e Lima
d,
Caio Marcos Flexa Rodrigues
c,
Anton De Spiegeleer
fg,
Evelien Wynendaele
eg,
Claudomiro Sales
b,
Bart De Spiegeleer
*eg and
Kauê Santana
*c
aInstituto Tecnológico Vale, 66055-090 Belém, Pará, Brasil
bLaboratório de Inteligência Computacional e Pesquisa Operacional, Instituto de Tecnologia, Universidade Federal do Pará, Campos Belém, 66075-110 Belém, Pará, Brasil
cLaboratório de Simulação Computacional, Instituto de Biodiversidade, Universidade Federal do Oeste do Pará, Campus Santarém, Santarém, Pará, 68.040-070, Brasil. E-mail: kaue.costa@ufopa.edu.br
dLaboratório de Planejamento e Desenvolvimento de Fármacos, Instituto de Ciências Exatas e Naturais, Universidade Federal do Pará, Belém, Pará 66075-110, Brasil
eDrug Quality and Registration (DruQuaR) group, Faculty of Pharmaceutical Sciences, Ghent University, Ottergemsesteenweg 460, B-9000 Ghent, Belgium. E-mail: Bart.DeSpiegeleer@ugent.be
fDepartment of Geriatrics, Faculty of Medicine and Health Sciences, Ghent University Hospital, Ghent, Belgium
gTranslational Research in Immunosenescence, Gerontology and Geriatrics (TRIGG) group, Ghent University Hospital, Ghent, Belgium
First published on 6th April 2026
Peptides, short chains of amino acids linked by peptide bonds, typically ranging from 2 to 50 residues, are fundamental to diverse biological processes and represent a valuable source for the development of novel bioactive compounds. In this work, we provide a comprehensive and conceptual overview of approaches to exploring the peptide chemical space. We emphasize intrinsic challenges in their chemical space investigation, particularly the complex interplay among peptide conformation, bioactivity, and bioavailability, as well as the role of sequence- and structure-derived molecular features in elucidating structure–activity relationships. Furthermore, we examine computational strategies, such as dimensionality reduction techniques, machine learning models, and similarity-based complex networks for classifying and characterizing this chemical space. Finally, we underscore the importance of interdisciplinary frameworks in advancing peptide research, highlighting how integrative approaches can uncover intersections of bioactivity across different peptide classes and leverage alternative chemical spaces to optimize and characterize peptide structures.
Professor Kauê Santana da Costa is an Adjunct Professor at the Federal University of Western Pará (UFOPA) (H-index: 16, ORCID: https://orcid.org/0000-0002-2735-8016, Web of Science: https://www.webofscience.com/wos/author/record/1851974), where he coordinates the Laboratory of Computational Simulation & Scientific Education and leads the Interdisciplinary Group for the Application and Development of Biomolecular Technologies. His research intersects theoretical and computational chemistry, bioinformatics, and artificial intelligence, with a particular focus on self-assembling, quorum-sensing activity, and membrane-penetrating peptides for nanocarrier and biotechnological applications. Over the last years, Professor Costa has become a reference in the development of machine-learning models for peptide bioactivity, especially cell-penetrating and blood–brain–barrier-penetrating peptides. He is a co-author of the Scientific Reports article “Predicting cell-penetrating peptides using machine learning algorithms and navigating in their chemical space” (2021), which introduced supervised models to classify CPPs and explore their chemical space. Building on this line, he contributed to the comprehensive review “Biological Membrane-Penetrating Peptides: Computational Prediction and Applications” in Frontiers in Cellular and Infection Microbiology (2022), and to “BrainPepPass: A Framework Based on Supervised Dimensionality Reduction for Predicting Blood–Brain Barrier-Penetrating Peptides” in Journal of Chemical Information and Modeling (2023), which couples supervised dimensionality reduction with peptide classification. Most recently, he co-authored the deep-learning study “Investigating molecular descriptors in cell-penetrating peptides prediction with deep learning: Employing N, O, and hydrophobicity according to the Eisenberg scale” in PLOS One (2024), refining descriptor selection for peptide prediction, and the ACS Omega article “Beyond Molecular Weight: Peptide Characteristics Influencing the Sensitivity of Retention to Changes in Organic Solvent in Reversed-Phase Chromatography” (2025), which deepens the understanding of peptide physicochemical behavior. He also coordinates the PepSpace, an international collaborative project that aims to map and organize the chemical space of peptides. The web server is under development, integrating supervised and unsupervised learning to analyze the chemical space and bioactivity of quorum-sensing, cell-penetrating, and blood–blood–brain–barrier-penetrating peptides, consolidating his role at the interface of peptide science and machine learning. |
Determining the chemical space of molecules aims to classify compounds, identify potential bioactive molecules, design and improve lead candidates, and understand their molecular properties.5,6 Independent of the method, mapping the chemical space can significantly enhance the efficiency of novel discoveries across various areas of chemistry, including chemical synthesis,7–9 quantum chemistry,10 materials science,11–13 and drug discovery.5,14,15
The chemical space of compounds can be explored by analyzing libraries of organic molecules using two-dimensional (2D) or three-dimensional (3D) visual representations of multidimensional descriptor spaces plotted in Cartesian coordinates, which often require dimensionality-reduction or clustering techniques.16,17
Fig. 1 illustrates the application of a computational workflow to map the chemical space of peptides. Chemical space mapping integrates heterogeneous peptide inputs—either sequence-based representations (primary structure) or 3D structural models (tertiary structure)—to derive informative molecular descriptors. The visual exploration of peptide chemical space can be performed using computational tools that efficiently provide a visual correlation between molecules exhibiting similar chemical and/or functional properties.18–20 The most straightforward approach to accomplish this task involves the development of pipelines that integrate: the calculation of molecular descriptors using chemical packages, such as RDKit,21 iFeature,22,23 and Mordred;24 the application of feature correlations tests; and finally clustering methods, such as k-means and density-based spatial clustering of applications with noise (DBSCAN); as well as dimensionality reduction techniques including t-distributed stochastic neighbor embedding (t-SNE)25 and uniform manifold approximation and projection (UMAP);26 and finally the graphical representations for the 2D and 3D visualization of the numerical representations generated from these projections.18,20
Several open-source computational tools have been developed to facilitate the exploration of molecular chemical space based on molecular descriptor calculations. ChemPlot is a Python-based tool designed for both static and interactive visualization of chemical space in molecular datasets, encompassing dimensionality reduction techniques and similarity computations.27 TMAP is a tool developed for Python and employed to visualize high-dimensional chemical space by organizing molecules into a minimum spanning tree structure through molecular fingerprint comparisons.28 Similarly, KNIME, an open-source software developed for data science and visual analytics, in which computational workflows are structured as flowchart-based pipelines, has dedicated extensions to support chemical space visualization and other cheminformatics applications.29
Fig. 2 illustrates an example of how the chemical space of blood–brain barrier penetrating peptides (B3PPs) and quorum-sensing peptides (QSPs) can be represented in a 2D chart with the results of the dimensionality reduction provided by PCA and the clustering of the peptides predicted by the k-means algorithm.
In contrast to coordinate-based representations, chemical space networks (CSNs) have been introduced for chemical space analyses, allowing the exploration of molecular properties without reducing dimensionality.30–33 Various similarity-based complex networks, including half-space proximal networks (HSPNs), metadata networks, and CSNs, have been utilized to study the bioactivity of compounds and their associated chemical space. Fig. 3 represents an overview of the applications of these methods in peptide science.
The similarity-based complex networks are graphical representations of the chemical space of peptides, where nodes represent the peptides and the edges between two nodes denote their pairwise similarity or dissimilarity relationships in the space.34 The distance between the compounds is often measured using similarity (or dissimilarity) distance metrics, such as Euclidean, Manhattan, Tanimoto, and Soergel coefficients. In these networks, the relevance of the elements is investigated using centrality measures (betweenness, closeness, and edge betweenness), as well as global network properties and their corresponding global measures, such as modularity, connectivity, density, and size.35,36 For example, the StarPep Toolbox is a platform to explore the chemical space of antimicrobial peptides (AMPs) through molecular network-based representations and similarity-search methods to support peptide drug repurposing, as well as the development and optimization of novel sequences.37 Recently, antiviral peptides (AVPs) were mapped into a chemical space using HSPNs and contextualized with metadata networks using the StarPep toolbox. The analyses revealed eight chemically distinct, biologically coherent AVP communities without fixed similarity thresholds and linking them to origins, functions, and viral targets through metadata networks. The authors performed a centrality-guided scaffold extraction, which revealed four non-redundant subsets suitable for modeling and multi-query searches. The mapping of motifs against non-AVP datasets indicated that motif burden correlates with higher predicted AVP probabilities, with peptides carrying four to five motifs achieving the highest scores across independent predictors, suggesting that the motif-driven design is an interesting strategy to expand AVP chemical space.33
Recently, similarity-based complex networks and machine-learning algorithms have been used to map the landscape of some classes of peptides.38–40 Half-space proximal networks, metadata networks, and chemical space networks are examples of computational methods that leverage graph theory to analyze and explore relationships among chemical entities based on a given molecular property.41,42 These methods aim to simplify and analyze the vast complexity of chemical data, associating this information with the desired properties. For example, Ayala-Ruano et al. (2022) used network analyses and similarity-guided screening to investigate the chemical space of antiparasitic peptides, emphasizing the challenge of discovering new therapeutic peptides from this vast chemical space. The authors combined HSPNs, CSNs, and metadata networks to identify central peptides and to perform multi-query similarity searches against the StarPepDB database. Although the model reported strong performance (Matthews Correlation Coefficient, MCC, values ranging from 0.834 to 0.965), challenges remained, especially regarding the high sequence diversity of peptides, the need for effective toxicity filtering, and the reliance on computational methods that may not fully capture the complexity of peptide interactions and biological functions.38 Similarly, a study conducted by Castillo-Mendieta et al. (2024) used chemical space complex networks to map the chemical space of hemolytic peptides and to enhance the design of safe peptide-based therapeutics. By analyzing a database of 2004 hemolytic peptides, the authors identified 12 consensus hemolytic motifs. They developed multi-query similarity searching models that outperformed the existing machine learning models in predicting hemolytic activity.39
In another study, Wang et al. (2024) improved a computational framework using a reinforcement learning (RL)-driven generative model integrated with graph attention mechanisms, which captured the connectivity structure between amino acid residues in peptides and used it to guide the search for optimal peptide sequences. The algorithm incorporates bioactivity and ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties, ensuring that the generated peptides meet drug-like criteria.43
In contrast to the consensus chemical space, the concept of the chemical multiverse was introduced as a collection of multiple chemical spaces, each defined by different descriptors. This concept emphasizes that there is no single chemical space; rather, various representations of the same set of molecules can yield distinct chemical spaces.44 The chemical multiverse allows for a more comprehensive analysis of compound datasets using multiple descriptors, which can capture different aspects of molecular structures and properties. This approach contrasts with the idea of a consensus chemical space widely applied in medicinal chemistry, which attempts to combine various descriptors into a single representation, potentially losing valuable information in the process. The chemical multiverse does not rely on multiple descriptor combinations; instead, it consists of various alternative graphical representations of the chemical space, incorporating different molecular properties derived from the investigated compounds, such as molecular fingerprints, structure-based or sequence-based descriptors.44
The description of the chemical space of peptides has faced significant challenges, partly due to their chemical structure, vast physicochemical properties, and their intrinsic polymer-like features related to the amide backbone repetition, which tends to mask the prediction of their bioactive properties, hindering the analysis of their properties in the chemical space.45,46 In addition, it has been demonstrated that some unrelated classes of peptides show unexplained intersections between their bioactivities,47,48 thus evidencing that some previous distinct chemical spaces exhibit molecular similarities that must be better explored to find possible intersections in the intervals of the molecular descriptors applied to characterize them.
The chemical space usually contains some distinct clusters named ‘constellations’, which are populated by molecules with specific properties that can be identified using scaffold-based analysis due to the presence of a common structural core.49,50 The Murcko framework has been widely applied to investigate the structural core of drugs, revealing structural information and distinguishing molecules by their ring systems, linkers, and side chain atoms.51 However, Murcko frameworks can only represent molecules containing ring systems; therefore, acyclic (linear) peptides are usually omitted from these analyses.50 Moreover, peptides are characterized by their large molecular size and shape, as well as by the presence of polar groups, which usually put them beyond the conventional predictors of drug-likeness for molecules52,53 and impose more complexity in evaluating their conformational changes and pharmacophore properties.54
Scrambled peptides contain similar amino acid composition (AAC) and sequence length; however, they could acquire different conformations, which confer them different biological activities.55,56 The study of scrambled peptides has demonstrated some of these peculiarities of peptides compared to other small molecules.
Chemically modified residues that undergo post-translational modifications are likely to be extensive. The peptide chemical space thus encompasses a wide array of changes that can significantly alter the properties and functions of peptides.61 Understanding the chemical space of peptides also involves the comprehension of the three-dimensional conformations adopted by these molecules in the solvation medium, as their geometries are closely related to the mechanisms of action associated with membrane permeability and stereoselectivity against the molecular target.60,62,63 The conformation of peptides refers to the spatial arrangement of their atoms adopted due to the rotation around a single bond over time. It intrinsically depends on the peptide sequence and the external environment, and it is related to the formation of the secondary structure propensity. Peptides dynamically adopt a collection of conformations distributed across a free energy landscape, with their occurrence governed by Boltzmann-weighted probabilities.64
Some conformational adaptability in peptides can also modulate their bioactivity, because changes in the conformational ensemble alter the positions of pharmacophoric side chains and the population of binding-competent states, thereby affecting potency, selectivity, and the recognition to the target.65–67 This is consistent with observations that scrambled variants of peptides with similar composition may adopt distinct conformations and exhibit different biological activities.56 As the secondary structure is a determinant of peptide bioactivity, some strategies have been developed to impose constraints on their conformation to control their bioactivity.68,69 For example, some peptide design strategies for AMPs include the incorporation of restrictors, such as lactam and disulfide bridges, which act as conformational inducers (promoting β-like structures) and enhance resistance to protease degradation.70,71
Two classes of peptides that naturally cross biological barriers illustrate the relevance of conformational changes: the cell-penetrating peptides (CPPs) and blood–brain barrier-penetrating (B3PPs).72,73 Both classes exhibit a conformational characteristic that influences their ability to cross these barriers, which is called chameleonic properties. This conformational property refers to their ability to change conformation in response to environmental conditions, particularly to expose or hide polar groups when crossing biological membranes.74,75 Some well-reported examples of chameleonic properties of peptides include cyclosporin A75 and some of its derivatives76 (Fig. 4, panel A), as well as some cyclic peptides.75,77 This property has significant implications for their chemical space and the overall functionality of biomembrane-penetrating peptides. By altering their conformation, these peptides can effectively navigate through the hydrophobic core of cell membranes, improving their bioavailability.77
![]() | ||
| Fig. 4 Chameleonicity and backbone N-methylation as determinants of passive membrane permeability in macrocyclic peptides. Panel (A) Cyclosporin A, a chameleonic molecule, shown in representative open (greater polar surface exposed) and closed (polar surface partially buried) conformations, with electrostatic potential maps (scale in kBT/e) that illustrate environment-dependent exposure/burial of polar groups through conformational switching and intramolecular hydrogen bonding. Center: Conceptual schematic of passive diffusion through a lipid bilayer: lower-permeability variants tend to retain solvent-exposed hydrogen-bond donors and acceptors, whereas higher-permeability variants more effectively mask polarity. Panel (B) Parent scaffold cyclo[Leu1, D-Leu2, Leu3, Leu4, D-Pro5, Tyr6] (compound 1) and the corresponding trimethylated analogue (compound 3), bearing backbone N-methyl groups at D-Leu2, Leu3, and Tyr6 (Me)—a modification pattern associated with higher permeability, as reported by White et al. (2011).93 | ||
Some computational models may struggle to accurately represent the conformational variability of peptides, posing challenges for predicting their pharmacokinetic properties.58 For example, the calculation of topological polar surface area (tPSA) does not depend on the three-dimensional characteristics of the molecules, and it has been widely applied to correlate with the hydrogen bond pattern of molecules in the aqueous phase.78 This property has been associated with the prediction models of solubility and passive diffusion through cell membranes.79–82 Elevated tPSA values are associated with complexation with water molecules and increased molecular volume, which can hinder membrane permeability.83 Typically, the penetration of compounds across cell membranes is restricted when tPSA exceeds 140 Å2.84 However, higher values are generally acceptable for macrocyclic peptides (tPSA = 220 Å2) and peptides exhibiting chameleonic properties (tPSA = 280 Å2).81,85 The Molecular 3D PSA (MPSA) has emerged as a more accurate measure of compound solubility and membrane permeability than the tPSA, as it considers the three-dimensional conformation of the compound in a given environment.86 The tPSA, however, does not depend on the tridimensional structure and could reach satisfactory prediction, especially when associated with the molecular weight of compounds.87 The macrocycle peptide cyclosporine A is an example of a natural peptide that exhibits chameleonic activity. Cyclosporin A has a high tPSA value of 279 Å2 and an MPSA value equal to 105 Å2 with approximately 62% of its PSA concealed in nonpolar environments.87,88
The permeability of peptides into cell membranes could be significantly influenced by their secondary structure.89 Studies have demonstrated the impact of peptide conformation on arginine-mediated internalization in cell membranes.90 Similarly, other studies have shown that for CPPs, some helices stabilized by hydrocarbon cross-links can effectively enter the cells.91,92 The N-methylation of cyclic peptides has been widely reported as an interesting strategy to improve the permeability in cell membranes.79,93 White et al. (2011), for example, demonstrated that methylated analogues generated by on-resin N-methylation improved their membrane intake. The Figure X, panel (B) shows the regioselective backbone N-methylation of a cyclic hexapeptide scaffold cyclo[Leu1, D-Leu2, Leu3, Leu4, D-Pro5, Tyr6] (compound 1) and its trimethylated analogue (compound 3) generated by on-resin N-methylation. In compound 3, D-Leu2, Leu3, and Tyr6 are N-methylated (Me), a pattern associated with markedly improved passive permeability in the study by White et al. 2011.93
Currently, most in silico models—including cheminformatics filters and machine-learning approaches used in peptide science—assume that these molecules cross biological membranes primarily via passive diffusion, implying that membrane penetration occurs mainly through biophysical interactions between the peptide's structure and the membrane.58 Consequently, active transport pathways, like receptor-mediated transcytosis, active influx transport, and carrier-mediated transcytosis, are often overlooked in the design of these models, primarily due to the intricate binding processes of membrane proteins associated with the conformational accommodation related to receptor binding.58,94 Furthermore, some cheminformatics filters that use a set of molecular descriptors and their intervals to characterize drug-like molecules, such as the BOILED-Egg model and Lipinski's Rule of Five, often fail to accurately predict the permeability of peptides due to their distinct chemical space.82,95 These classical rules and empirical models were largely developed and calibrated using small molecules and therefore delineate a region of property space enriched for passive permeability and oral bioavailability.79,95 When applied to peptides, these filters often fail to accurately predict peptide permeability and overall drug-like behavior because peptides typically present higher molecular weight, larger polar surface area, multiple hydrogen-bond donors/acceptors, and higher conformational flexibility—features that shift them outside conventional small-molecule boundaries.83,95 Therefore, applying small-molecule drug-likeness rules can artificially truncate peptide chemical space and may lead to misleading conclusions to their bioactivity or bioavailability.
Moreover, some peptide classes can partially overcome these restrictions through structural adaptations that are not captured by simple 2D descriptors used in the cheminformatic filters. For example, macrocyclic and “chameleonic” peptides can conceal polar surface area via intramolecular hydrogen bonding and environment-dependent conformational changes, thereby improving passive permeability despite high tPSA values. In addition, descriptors that incorporate 3D conformation may be more informative than purely 2D metrics for some peptide subclasses.62,74
Considering these limitations, efforts have been made to accurately represent their geometries by analyzing the rotamers and possible conformational changes of peptides to enhance the prediction of their molecular activities.96,97
An overview of molecular descriptors applied to peptides is demonstrated in Fig. 5.
![]() | ||
| Fig. 5 Classes of molecular descriptors commonly used to represent peptide sequences and structures and to analyze peptide structure–activity relationships. | ||
Many descriptors widely used in peptide science and medicinal chemistry are computed from 2D connectivity or composition and therefore do not require a specific 3D conformer (e.g., MW, HBA/HBD counts, tPSA, atom counts, fragment-based log
P, and 2D fingerprints). In contrast, geometry-, surface-, and shape-based properties are conformation-dependent and should be computed from an explicit 3D structure or a conformational ensemble (e.g., SASA/3D PSA such as MPSA, WHIM, GETAWAY, and 3D-MoRSE descriptors). The 2D representation is often sufficient to calculate most descriptors since these descriptors capture features like atomic constitution, the presence of specific chemical groups, physicochemical properties, and molecular topology. The selection of the most appropriate molecular descriptors usually depends on the type of peptide under investigation and its associated biological activity.58
Descriptors used in large peptide libraries are broadly borrowed from medicinal chemistry, originally designed for small molecules, where they proved useful in predicting drug-likeness and bioavailability. Key examples include tPSA, MW, HBA, and HBD, number of aromatic rings (NAR), the fraction of sp3-hybridized carbon atoms (Fsp3), lipophilicity calculated by the logarithm of 1-octanol/water partition coefficient (log
P), number of chiral centers (NCC), and the number of rotatable bonds (NRB).79,84,99,100 Additional descriptors capture structural features, such as secondary structure composition, ionization state, topology, shape, and hydrophilicity.
The importance of these descriptors lies in their connection to bioavailability, i.e., how efficiently a compound can dissolve in aqueous environments and cross biological membranes. Two widely used descriptors are log
P and tPSA, which reflect lipophilicity and polar surface area, respectively.82,83 Similarly, intrinsic solubility can be explored through lipophilicity (e.g., log
P and the logarithm of the distribution partition coefficient (log
D) at pH 7.4), structural constitution (e.g., NAR), and molecular flexibility (e.g., Fsp3, NCC, and NRB).53,79,81,82,99,101
Molecular descriptors associated with the ionization state are informative of aqueous solubility (hydrophilicity) and include the isoelectric point (pI) and the logarithmic value of the acid dissociation constant (pKa). Molecular hydrophobicity (lipophilicity) is usually quantified by the calculations of log
P values, and alternative computational methods, such as X
log
P,102 C
log
P, and A
log
P103 calculate some of its derivative values. The A
log
P has demonstrated superior predictive accuracy for peptides compared with other calculated log
P values.104
Certain molecular descriptors that capture the shape and topology of peptides can be linked to their stereoselectivity towards molecular receptors. These descriptors reflect the spatial rearrangement and connectivity of the molecules. Examples include Kier's Kappa indices,105 Balaban indices,106 Burden eigenvalues,107 and Randić shape indices.108 Other descriptors focus on molecular complexity. For instance, Basak's Indices provide a numerical representation of structural features such as connectivity, branching, and overall topology.109
Among these shape- and topology-oriented metrics, the Kappa descriptors stand out as topological indices derived from the hydrogen-suppressed molecular graph. They quantify molecular shape and the intricacy of branching by considering paths of different lengths through the structure, comparing the observed branching to an idealized linear or maximally branched reference. Because they are sensitive to the global shape rather than the size of the molecule, Kappa indices offer a refined picture of molecular flexibility and compactness. In QSAR applications, they are particularly useful for highlighting steric and topological features that influence biological activity and receptor interaction.110
Another important topological descriptor known for its low degeneracy and strong correlation with physicochemical properties is the Balaban index. It is calculated from the distance matrix of the molecular graph, integrating information from distance sums and the number of edges. By encoding details related to cyclicity, branching, and structural compactness while remaining largely independent of molecular size, the Balaban index stands out at distinguishing structural isomers, which is especially valuable in studies involving complex ring systems and detailed structure–property relationships.111
To study the geometry of compounds, researchers often rely on 3D molecular descriptors, which require the explicit 3D conformation of a molecule to capture the spatial arrangement of its atoms. These descriptors are widely used in structure–activity relationship (SAR) analyses. Notable examples include WHIM (Weighted Holistic Invariant Molecular descriptors), GETAWAY (GEometry, Topology, and WAter Accessibility),112 and 3D-MoRSE (Molecular Representation of Structures based on Electronic diffraction).113
The 3D-MoRSE encodes fundamental 3D atomic coordinates of the molecular structure using a fixed-size vector, drawing on concepts akin to electron diffraction. The 3D-MoRSE descriptors differ from purely topological indices because they incorporate three-dimensional structure along with electronic information. They are generated from simplified theoretical scattering curves—conceptually similar to electron diffraction—using atomic coordinates and weighted properties such as mass or partial charges. The resulting values represent the distribution of electron density across a range of scattering angles. In doing so, the 3D-MoRSE descriptors capture steric effects, electrostatic interactions, and other conformation-dependent features that cannot be inferred from 2D graph topology alone.113
The WHIM (Weighted Holistic Invariant Molecular descriptors)114 is a set of 3D molecular descriptors that capture the spatial arrangement of atoms in a molecule.114 The WHIM descriptors91 are also based on 3D atomic coordinates but summarize them through a weighted principal component analysis. It's used a covariance matrix built from atomic positions, with weights that may reflect atomic mass, polarizability, electronegativity, or other physicochemical attributes. Because they derive from molecular invariants, WHIM descriptors remain consistent under translation and rotation of the molecule, providing a global statistical representation of size, shape, symmetry, and atom distribution relative to three orthogonal axes.115
The GETAWAY (GEometry, Topology, and Atom-Weights AssemblY) is a 3D-molecular geometry provided by the molecular influence matrix and atom-relatedness by a molecular topology using different atomic weighting.112 The GETAWAY descriptors complement these approaches by integrating geometric features with molecular topology through the influence matrix and various atom-relatedness measures. Together, this family of 3D descriptors enables characterization of steric, electronic, and conformational aspects that strongly influence molecular recognition processes.112
Measures of molecular complexity lack a universal concept, but they have been frequently associated with synthetic accessibility and, in the context of drug design, with the specificity to a molecular target.116 Currently, different descriptors capture various aspects of molecular complexity, and their complementary use may provide an overview of this molecular property, such as topological and physicochemical descriptors (e.g.: NCC and Fsp3)99,101 and some substructure-based descriptors (number of rings, unsaturations, and heteroatoms). For peptides, the difficulty of synthesis has been associated with long amino acid chains and functional groups associated with their side chains.117 A non-exhaustive list of the most applied structure-based descriptors to investigate peptides is presented in Table 1.
| Structure-based descriptors | Peptide information |
|---|---|
Notes: CC: number of chiral carbons; Fsp3: fraction of sp3-hybridized carbon atoms;101 GETAWAY: GEometry, topology, and atom-weights AssemblY;112 HBA: hydrogen bond acceptors, HBD: hydrogen bond donors; log P: 1-octanol/water partition coefficient; 3D-MoRSE: 3D molecular representations of structure based on electron diffraction; MPSA: molecular 3D polar surface area; MW: molecular weight; MSA: molecular surface area; NCC: number of chiral centers; NPA: number of primary amino groups (–NH2); NHA: number of heavy atoms, NAR: number of aromatic rings, NG: number of guanidine groups; NNCAA: number of negatively charged amino acid groups; NRB: number of rotatable bonds; pKa: logarithm of the acid dissociation constant, pI: isoelectric point; tPSA: topological polar surface area;78 PSA: polar surface area; VdW: van der Waals volume; SASA: solvent accessible surface area; XLOGP3: log P estimated from the atom/fragment contribution values;120 WHIM: weighted holistic invariant molecular descriptors.114 |
|
| Bounds count, heavy atoms count, atoms type count (e.g., N-, O-, C-, S), NRB. | Atomic constitution |
| NPA, NG, NNCAA | Presence of molecular groups |
| MW, NHA, VdW, MSA | Molecular size |
Log P, log D (pH 7.4), X log P3, A log P, C log P, log Kow |
Lipophilicity/hydrophobicity |
| Kappa indices (Kappa1, Kappa2, and Kappa3), Burden eigenvalues, Basak's Indices, Balaban index, Weiner index, Randić indices, WHIM indices114 | Molecular shape and topology |
| Fsp3, NRB | Molecular flexibility |
| tPSA, MPSA, PSA | Polar surface (polarity) |
| WHIM indices,114 3D-MoRSE,113 GETAWAY112 | Molecular geometry |
| NCC,99 Fsp3,99,101 CC,118 | Molecular complexity |
HBA, HBD, tPSA, net charge, pKa, log P, X log P, A log P, C log P, log Kow, SASA |
Hydrophilicity (aqueous solubility) |
| pKa, pI | Ionization state |
| Number of α-helices, number of β-sheets, number of coils | Secondary structure |
| Eisenberg scale119 | Hydrophobic moment |
Peptides contain both hydrophilic and hydrophobic regions, often influenced by the relative abundance of specific amino acid residues, which in turn shape their molecular mechanisms of action. A notable example is the CPPs, which are typically enriched in lysine and arginine residues. This composition accounts for their cationic or amphipathic nature at physiological pH, as well as their water solubility and cell membrane permeability.132,133 Studies have demonstrated that incorporating arginine into cyclic peptides and protein surfaces enhances cell penetration.134,135 Adapting the amino acid structure of a peptide might influence their biological activity and drug-like properties.5 Cyclization and N-methylation are examples of chemical modifications that shift the peptides' positions within the chemical space, enhancing their potential bioavailability and biological activity.5 For example, a study revealed significant overlaps between the chemical space of synthetic linear and cyclic pentapeptides containing N-methylation and some FDA-approved peptide drugs. Some studies have shown that machine-learning algorithms applied to predict some classes of peptides that employ an optimized integration of sequence- and structure-based descriptors in the feature composition achieve greater accuracy than those relying solely on sequence- or structure-based descriptors.15,136–138
The sequence-based properties have also been used to design novel domains and modulate switchable properties of peptides. Self-assembling peptides (SAPs) are short polypeptide chains that, in an aqueous solution, can spontaneously organize themselves into complex, well-ordered, and stable nano- and meso-structures through the formation of non-covalent interactions,139,140 thus forming versatile building blocks which have been extensively studied to create stimuli-sensitive supramolecular systems.141–143 The amino acid sequence composition and the orientation of the amino acids of SAPs could play a critical role in driving the self-assembling properties. For example, in the SAP sequence, aromatic amino acids, such as phenylalanine, tyrosine, and tryptophan, contribute mostly to aggregation through π-stacking as the main driving force for self-assembly.144 On the other hand, the presence of histidine, serine, and threonine amino acids has highly polarizable side chains, and thus, these peptide structures could promote aggregation through hydrogen bonding formation. Some SAP structures are characterized by their amphiphilicity, meaning their sequences contain hydrophilic and hydrophobic domains that facilitate self-assembly in aqueous solutions, forming non-covalent interactions between amino acid residues.145
Peptide molecules that self-assemble into peptide nanofibers are primarily amphiphilic molecules. These consist of hydrophilic heads containing active peptide segments, hydrophobic tails with alkyl chains, and several amino acids between these two regions, creating enough space to prevent spontaneous aggregation after introducing negative charges.146,147 Amphipathic peptides are more likely to self-assemble into amyloid-like β-sheet fibrils when their primary sequence shows a pattern of alternating hydrophobic and hydrophilic amino acids. These fibrils form a bilayer structure comprising two β-sheets that align to conceal the hydrophobic side chains within the bilayer's interior. In contrast, the hydrophilic side chains remain exposed on the surface of the bilayer.148,149 Recently, a study demonstrated that the SAP sequence significantly influences structural sensitivity to supramolecular polymerization pathways, affecting the resulting polymers' structural and functional properties.150 Yuan et al. (2022) demonstrated that the order of amino acids in the sequences AAEE and AEAE (A and E represent alanine and glutamic acid, respectively) impacts the driving forces involved in peptide polymerization, which directly correlates with mechanical properties and bioactivity.150
Some computational models have been developed using a combination of sequence-based and structure-based descriptors to predict the bioactivity of peptides. These models have improved performance compared to algorithms relying solely on one class of molecular descriptors.15,94,124,136,137 For example, Rajput et al. (2015) analyzed the QSPs according to their amino acid composition, residue position, physicochemical properties, and sequence motifs and identified that some aromatic residues, such as tryptophan, tyrosine, and phenylalanine play an important role in their characterization, as well as positional preferences of residues, such as serine at the N-terminal end and phenylalanine at the C-terminal end so that these sequence-based properties could be used for their identification.124 Physicochemical properties, such as aromaticity, molecular weight, and secondary structure, contribute to QSP identification.124 Recent approaches utilize propensity score representation learning to extract and combine the propensities of amino acids and dipeptides.151
Sequence-based descriptors provide a quantitative framework for analyzing peptides, capturing features such as amino acid composition, positional distribution, and sequence arrangement patterns. These descriptors enable the application of statistical and machine learning approaches to uncover structure–activity/property relationships in peptide research.127,152–154 In this context, we focus on molecular descriptors and scoring matrices that support large-scale, data-driven analyses. Several Python libraries, widely used for big data applications, offer built-in tools for calculating such descriptors, such as BioPython,155 RDkit,21 Mordred,24 and PyBioMed.156 Additionally, there are tools focused on peptide analysis, such as PepFun,157 iFeature,23 iFeatureOmega,22 Peptide.py (https://pypi.org/project/peptides/),158 and PepFuNN.159
Some sequence-based descriptors evaluate the amino acid or k-mers constitution, providing information about their relative abundance or scarcity of amino acids, such as AAC,160 dipeptide composition (DPC),161 tripeptide composition (TPC),152 and terminus composition (TC).162 k-mers are substrings created by moving a window of length k along the sequence at a set interval. In addition to the constitution, they reflect the overall frequency of these amino acids.
Other sequence-based encoders offer information about group- and gap-based amino acid rearrangements. The group-based amino acid descriptors aim to mitigate the high-dimensional data derived from the existence of 20 amino acids, so this class of encoders groups or reduces the amino acid compositions to investigate the peptide sequences. High-dimensional data can lead to overfitting, compromising the prediction accuracy of the models when the number of features exceeds the number of independent samples.163 Thus, these descriptors extract characteristics that better reflect the relationships of groups of amino acid residues in the sequence. The group-based amino acid composition descriptors, for example, include the grouped tripeptide composition (GTPC),164 grouped dipeptide composition (GDPC),164 pseudo k-tuple reduced amino acid composition (PseKRAAC),165 and the grouped amino acid composition (GAAC).23 In contrast, the gap-based amino acid descriptors create bi-mers from peptide sequences using various gap sizes, and subsequently analyze the distribution of the resulting gap-based bi-mers. These sequence-based descriptors include composition of k-spaced amino acid pairs (CKSAAP),166 and adaptive skip dipeptide composition (ASDC).167,168
In addition to these descriptors, some libraries extract from the sequence the AAindex, a curated, literature-derived database that compiles numerical indices describing physicochemical properties of amino acids. In its core component (AAindex1), each “amino acid index” represents a single property as a set of 20 numerical values, one per standard amino acid, enabling sequences to be converted into quantitative property profiles.169
Several substitution and scoring matrices have also been developed to represent the variability, the physicochemical properties, and substitution patterns of polypeptide sequences, including position-specific scoring matrix (PSSM), residue pairwise energy content matrix (RECM),170 Z-scale,171 and BLOcks Substitution Matrix (BLOSUM).172 The BLOSUM and PAM matrices, for example, are derived from oligopeptide sequence alignments, and both are commonly used as encoders to characterize peptide sequences based on their evolutionary substitution profiles (Table 3),173 showing variations depending on the identities of the pre-computed datasets. For example, BLOSUM matrices come in different versions, such as BLOSUM50, BLOSUM62, and BLOSUM80, created using the observed frequencies of amino acids in peptide sequences. The 62% identity threshold (BLOSUM62) is widely used for peptide and protein sequence characterization.174,175 In contrast, the position-specific scoring matrix (PSSM), residue pairwise energy content matrix (RECM), and z-scale171 are classified as scoring matrices applied for amino acid sequences (see Table 3). The z-scale, for example, is an amino acid descriptor set used to numerically represent the physicochemical, hydrophobic, and polar properties of amino acids in protein or peptide sequences. This matrix is derived from a PCA of various amino acid and physicochemical properties, reducing them into a few orthogonal components.171
While the content of the secondary structure is dependent on the conformation of the peptide, and is more accurately calculated using information derived from the three-dimensional structure, several sequence-based prediction methods have been developed that demonstrate promising results for predicting secondary features and classifying oligopeptide sequences.176–178 For example, Zhang et al. (2011) developed a transition probability matrix to represent secondary structures,178 and Dai et al. (2013) introduced a statistical position-based feature of secondary structural elements to predict the structural classes of oligopeptide sequences.177 The secondary structure elements content (SSEC), for instance, is a molecular descriptor calculated from the primary structure predicted by the PSIPRED V4.0 and provides the content of three types of secondary structure elements.23
The correlation encoder quantifies the relationship between amino acids by calculating correlation coefficients that reflect differences in the molecular descriptors that reveal information about hydrophobicity, hydrophilicity, mass, shape, topology, constitution, etc. These descriptors reveal how specific properties of amino acids are interrelated to the sequence. Moran,179 Normalized Moreau-Broto, and Geary179 are autocorrelation descriptors that uses eight amino acid indexes by default for peptide sequences, according to the following: the DAYM780201 represents the the residue substitution profile, the CHOC760101 represents the residue accessible surface area in tripeptide, the CIDH920105 represents the normalized average hydrophobicity scales, the BHAR880101 represents the average flexibility indices, CHAM820101 represents the polarizability parameter, CHAM820102, represents the free energy in water, the BIGC670101 represents the volume of the residue, and the CHAM810101 the steric parameter.23,180
Binary encoders are descriptors that transform amino acid sequences into statistical vectors, with each amino acid encoded as a 20-dimensional binary vector consisting of 0 s and 1 s. The binary representation is subdivided into 3, 5, 6, and 20 bits, and they represent some groups of amino acids of the sequence depending on their physicochemical properties.181 For example, the binary 6-bit uses a six-element amino-acid groups {e1, e2, e3, e4, e5, e6} to encode the oligopeptide sequence, where e1 ∈ {H, R, K}, e2 ∈ {D, E, N}, e3 = C, e4 ∈ {S, T, P, A, G}, e5 ∈ {M, I, L, V}, e6 ∈ {F, Y, W}. These groups capture conservative substitutions that can occur over evolutionary time. They function as equivalence classes grouping amino acids by similarity, and their definitions are based on PAM-based relationships. Then, each group is represented by a 6-dimensional binary vector, e.g., e1 is encoded by (100
000), e2 is encoded by (010
000), and so on.22 In the sparse encoding approach, each peptide sequence is mapped to a fixed-length vector of 100 positions, corresponding to the maximum sequence length stored in the database. A reference list containing the 20 standard amino acids plus one additional symbol for gaps or empty positions is used. Each amino acid is converted into a one-hot vector of length 21, where a single element indicating its position in the list is set to “1”, and the remaining elements are set to “0”. Consequently, every position in the 100-length sequence corresponds to a 21-dimensional vector. This representation ensures that each amino acid is uniquely identified by its position within the encoding space.181
A list of molecular descriptors derived from the sequence is described in Table 2. A list of applied scoring and substitution matrices is described in Table 3. A list of autocorrelation descriptors associated with the amino acid indices is presented in Table 4.
| Sequence-based descriptors | Peptide information |
|---|---|
| Notes.a GRAVY: grand average of hydropathy, corresponds to the value of the hydropathic index calculated by the Kyte–Doolittle method using the peptide sequence.b FLEX index: corresponds to the structural flexibility calculated from the peptide sequence according to the Vihinen et al., 1994. | |
| Amino acid composition (AAC)160 | Frequencies of the 20 types of native amino acids present over the peptide sequence |
| Pseudo-amino acid composition (PseAAC)182 | Frequencies of the discrete sequence correlation factors and the twenty components of the conventional amino acid composition |
| Amphiphilic pseudo-amino acid composition (APAAC) | Frequencies of the discrete sequence correlation factors related to the hydrophobicity and hydrophilicity |
| Dipeptide composition (DPC)161 | Frequencies of 400 types of dipeptides present over the sequence |
| Tripeptide composition (TPC)152 | Frequencies of 8000 types of tripeptides present over the sequence |
| Grouped amino acid composition (GAAC)23 | Frequencies of five groups of amino acids based on their physicochemical properties: negative charge (D, E), positive charge (H, R, K), aromatic group (F, Y, W), aliphatic group (A, G, I, L, M, V), and uncharged (C, N, P, Q, S, T). |
| Terminus composition (TC)162 | Frequencies of amino acids and dipeptides for 5, 10, and 15 residues present at the N- and C-terminus of the peptide sequence. |
| Composition of k-spaced amino acid pairs (CKSAAP)166 | Frequencies of 400 types of residue pairs separated by k other amino acids (k = 1, 2, 3) within a sequence or sequence fragment. |
| CTDT (composition/transition/distribution) | Distribution of amino acid composition patterns linked to specific chemical, physical, or structural properties within the peptide sequence. The composition (C) refers to the amino acid composition in sequence, the transition (T) corresponds to changes among three patterns: neutral, hydrophobic, and polar, and the distribution (D) refers to the pattern of distribution of these properties over the sequence. |
| Pseudo K-tuple reduced amino acids composition (PseKRAAC)165 | Frequencies of the 16 types of reduced K-tuple pseudo amino acids calculated from the sequence-order information for all dipeptides and the correlation between nth nearest residue. |
| Adaptive skip dipeptide composition (ASDC)167 | Frequencies of amino acid pairs separated by a variable (adaptive) number of intervening residues. |
| Quasi-sequence-order descriptors (QSOrder)183 | Frequencies of the amino acid sequence orders calculated using the sequence-order-coupling numbers that reflect the interactions between amino acids at various ranks of proximity. The coupling factor used to calculate these numbers is based on the physicochemical distance between amino acids, which considers properties like hydrophobicity, hydrophilicity, side-chain volume, and polarity. |
| Secondary structure elements content (SSEC) | Number of α-helices, β-sheets, and coils |
| Shannon information entropy | Scoring value that measures the degree of variability at a specific amino acid position in a multiple sequence alignment |
| AAindex169 | Compilation of literature-reported scales that quantify physicochemical tendencies of the standard amino acids. Each scale corresponds to one property and is encoded as a 20-element numeric vector, assigning a specific value to each amino acid |
| GRAVY index184,a | Hydropathic character |
| FLEX index185,b | Structural flexibility |
| Scoring and substitution matrices | Peptide information |
|---|---|
| Note.a Different identity thresholds can be applied in BLOSUM to characterize peptide sequences, with 62% typically used in most alignments. | |
| Position-specific scoring matrix (PSSM)186 | Scoring matrix containing the likelihood of each amino acid at a specific position in a peptide sequence. It is derived from multiple sequence alignments, aiding the identification of conserved regions. |
| Residue pairwise energy content matrix (RECM)170 | Scoring substitution 20 × 20 matrix containing residue pairwise energy for 20 standard amino acids derived from the primary structure of 674 proteins. |
| BLOcks Substitution Matrices (BLOSUM)172,a | Substitution 20 × 20 matrix based on observed substitutions in conserved blocks, with a threshold of identitya. |
| Grantham distance matrix187 | Scoring substitution 20 × 20 matrix that incorporate residue substitution frequencies that better correspond to the overall chemical differences including composition, polarity, and molecular volume. |
| Point accepted mutation (PAM) matrices (also named Dayhoff matrices) | Substitution 20 × 20 matrix where each entry in the matrix represents the likelihood of one amino acid being replaced by another through accepted mutations over a specified evolutionary period |
| z-Scale171 | Scoring 87 × 26 matrix applied for amino acid sequences, where the 87 rows correspond to the different amino acids (including 20 standard amino acids plus many non-coded or unusual ones) and 26 columns correspond to different physicochemical descriptor scores. |
| Autocorrelation descriptors | Equations |
|---|---|
Notes.a I(d) is the Moran autocorrelation, d is the lag of the autocorrelation, nlag is the maximum value of the lag (default value: 30), Pi and Pi+d are the properties of the amino acids at positions i and i + d, respectively. is the average of the considered property P over the entire sequence of length N.b C(d) is the Geary autocorrelation, d, P, Pi, and Pi+d, nlag, and N have the same definitions as defined for Moran.c ATS(d) is the normalized Moreau-Broto autocorrelation, AC(d) is the Moreau-Broto autocorrelation, d, P, Pi, and Pi+d, nlag, and N have the same definitions as defined for Moran. |
|
| Morana | ![]() |
![]() |
|
| Gearyb | ![]() |
| Normalized Moreau-Broto autocorrelationc | ![]() |
![]() |
|
Molecular fingerprints provide a cost-efficient computational method for analyzing large compound libraries, due to their compact representation of complex molecular structures,6,192 which justifies their integration with computationally demanding virtual screening techniques.2,193 Molecular fingerprints serve as representations of a chemical structure that encode the presence or the absence of a particular molecular feature.193,194 These types of molecular representation are essential for analyzing large chemical libraries and comparing their structures using quantitative assessment of pairwise similarity.194 Currently, six main categories of molecular fingerprints are used to describe molecules: (1) descriptor-based, (2) substructure-based, (3) pharmacophore-based, (4) path-based (or hashed), (5) string-based, and (6) circular fingerprints.195
Descriptor-based fingerprints use molecular features derived from physicochemical properties, such as the van der Waals surface area (VSA) fingerprint. The substructure-based fingerprints are used to identify the presence of specific substructures, including functional groups and rings of certain sizes. This class includes the MACCS (Molecular ACCess System) key fingerprint.196 The pharmacophore fingerprints encode the pharmacophore groups present in molecules, and this class characterizes the interaction of the molecules with the protein environment. Belonging to this class is the MXFP, an atom-pair fingerprint that describes molecular shape and pharmacophores.192 The path-based (or hashed) fingerprints identify all types of subgraphs, including linear subgraphs representing the shortest paths between atom pairs and circular fingerprints that capture the neighborhoods of bonded atoms, hashing them inside a fixed-size vector. Atom-pairs are a subclass of path-based that describes a molecule by analyzing all possible triplets present in two atoms and the shortest path that connects them.197 These fingerprints include the E3FP,198 ECFP, and MAP4.6 The string-based fingerprints create molecular representations by analyzing the SMILES string of a compound rather than its graphical representation. Finally, the circular fingerprints decompose the analyzed compound into various fragments, similar to substructure-based fingerprints. However, instead of depending on predefined structural patterns, they dynamically generate these fragments from the molecular graph of each compound.
Currently, most virtual screening strategies or chemical space mapping of compounds applied in drug discovery use the MACCS key fingerprint, Morgan fingerprint – commonly referred to as the ECFP fingerprint,199 and MinHashed fingerprint MHFP6.200 Nevertheless, these molecular fingerprints usually struggle to accurately capture the overall characteristics of molecules, including their size and shape. Additionally, they are inadequate at recognizing structural variations that could be significant in larger molecules, such as distinguishing between linkers of varying lengths, identifying scrambled peptide sequences with the same amino acid composition and sequence length, or differentiating between regioisomers.52
A pharmacophore-based fingerprint derived from the 2D structure of peptides, termed 2DP, was developed to encode the molecular shape and pharmacophore properties of peptides. This fingerprint represents the peptides’ topology as a graph where nodes correspond to α-carbon atoms and edges represent bonds between them. This fingerprint captures key molecular features, including the number of hydrophobic, positively charged, negatively charged, and total non-hydrogen atoms in each residue. Distances between atom pairs are calculated along the shortest path in the peptide's topology, and Gaussian functions centered on these distances are used to generate a 136-dimensional chemical space. This fingerprinting method enables the exploration of peptides with unknown or flexible 3D structures, making it particularly suited for studying unconventional topologies like bicyclic peptides.201
It has been demonstrated that some 2D fingerprints can effectively distinguish between peptide-like molecules with varying degrees of biological activity. Eckert and Bajorath (2007) found that Molprint2D performed best in recovering active molecules with strong peptide character. However, the property descriptor-based fingerprint excelled in identifying compounds with lower peptide character, indicating its utility in transitioning from peptide-like compounds to non-peptide alternatives.45 Capecchi et al. (2020) developed the MAP4 which represents the relationships between pairs of atoms in a molecule, considering their types and the topological distance. This fingerprint was designed to handle large and complex molecules, such as peptides, proteins, and peptide-like compounds, while maintaining computational efficiency.52 Recently, Capecchi and Reymond (2021) used a genetic algorithm with the molecular fingerprint MAP4 to represent the chemical space of peptides, organizing them by sequence and size. The chemical space represents 40
531 peptides from eleven open-access peptide and peptide-containing databases, and the map obtained categorizes the peptides by activity type, indicating that the majority of the peptides in the investigated databases, comprising 17
260 sequences, or 43% of the total, are classified as antimicrobial and anticancer.6 The Reymond group also developed MAP4C, a chiral adaptation of the MAP4 fingerprint, to analyze the stereochemical properties of large molecules, such as peptides. This fingerprint generates MinHashes derived from character strings encoding the SMILES representations of all pairs of circular substructures with diameters of up to four bonds and the shortest topological distance between their central atoms. The MAP4C incorporates Cahn–Ingold–Prelog (CIP) annotations (R, S, r, or s) for chiral atoms at the center of circular substructures, uses a question mark for undefined stereocenters, and includes cis–trans information for double bonds when specified. In non-stereoselective virtual screening approaches, MAP4C performs slightly better than the achiral MAP4, ECFP, and AP fingerprints.202 To evaluate the chemical space of antimicrobial peptides (AMPs) and identify new candidates with therapeutic potential, Orsi et al. (2024)190 integrated cheminformatics, ligand-based virtual screening, and machine-learning techniques. Virtual peptide libraries, including bicyclic and dendritic structures, were constructed and analyzed using molecular fingerprints, such as MAP4 and its chiral variant, MAP4C. These fingerprints measure molecular similarities and facilitate the visualization of the chemical space through dimensionality reduction methods, including PCA. The ligand-based virtual screening was employed to prioritize AMP candidates based on their similarity or diversity to known bioactive molecules, significantly enhancing the efficiency and success rate compared to random selection. Furthermore, machine-learning models, such as support vector machines and recurrent neural networks, were trained on experimental AMP datasets to predict antimicrobial activity and toxicity, aiding in identifying promising peptides for experimental validation.190
Some molecular fingerprints applied for peptide analyses are described in Table 5. We focused on explaining the most commonly used fingerprints, which are applied for the analysis of peptides and are usually accessible on the most used C++, Java, and Python libraries, such as Scikit-fingerprints,203 RDKit,21 iFeatureOmega,22 and Open Babel.204
| Molecular fingerprint | Description | Category | Implemented libraries or websites |
|---|---|---|---|
| MinHashed atom-pair up to a diameter of four bonds fingerprint (MAP4)52 | The circular substructures (radii r = 1 and r = 2) around each atom in an atom pair are represented as two SMILES pairs linked by the topological distance between the central atoms. These atom-pair molecular shingles are hashed and then undergo MinHashing. | Circular and path-based fingerprint (subclass atom-pair) | scikit-fingerprints GitHub (https://github.com/reymond-group/map4) |
| Chiral MAP4 (MAP4C)6 | Chiral representation of the MAP4 fingerprint | String-based and path-based fingerprint | GitHub (https://github.com/reymond-group/mapchiral) |
| Molecular ACCess System (MACCS Key) | Consists of a fixed-length bit vector, typically 166 bits. Encodes molecules' predefined substructures or functional groups, such as rings, bonds, and specific atom types. | Substructure fingerprint | RDKit, OpeBabel, and scikit-fingerprints |
| Extended Connectivity Fingerprint 6/4 (ECFP6, ECFP4) | Encodes the environment of each atom circularly, capturing information about the atom and its neighboring atoms up to a specified radius. | Path-based fingerprint | RDKit, scikit-fingerprints, OpenBabel, and iFeatureOmega |
| Extended-connectivity count fingerprint (ECFC6) | A variant of ECFPs that not only indicates the presence of specific substructures but also counts the occurrences of each substructure within the molecule. The “6” refers to the maximum diameter considered during the fingerprint generation. | Path-based fingerprint | RDKit, scikit-fingerprints, and OpenBabel |
| DompeKeys205 | Set of substructure-based fingerprint descriptors designed to encode patterns of functional groups and chemical features within molecular structures. | Substructure fingerprint | Developer website (https://dompekeys.exscalate.eu), |
| MolPrint2D206 | Encodes molecular structures by representing the atom environment up to a specific distance. It generates exhaustive lists of substructures surrounding each atom, which are then indexed for similarity comparison. | Circular fingerprint | OpenBabel |
| Macromolecule eXtended FingerPrint (MXFP)192 | A 217-dimensional fuzzy fingerprint representing atom pairs from seven pharmacophore groups, which is ideal for comparing large molecules and facilitating scaffold hopping. | Pharmacophore fingerprint | Developer GitHub (https://github.com/markusorsi/mxfp_python) |
In peptide modeling, embeddings are generated through neural or deep-learning encoders trained on sequence and/or structural data. These encoders learn internal representations that capture regularities in biochemical composition, residue interactions, structural motifs, and context-dependent effects.207,209 The resulting latent space reflects patterns discovered from the data and organizes peptides according to shared properties and structural similarity.214,215 A well-trained embedding preserves chemically relevant information while structuring peptides in a way that facilitates similarity analysis, interpolation, clustering, screening, and predictive modeling.210,216 A schematic overview of an embedding-based workflow applied to peptide sciences is shown in Fig. 6.
The choice of representation of peptides is especially important because these molecules can be encoded at different hierarchical levels. A growing body of peptide cheminformatics literature emphasizes that amino acid-based representations often better reflect the functional building blocks of peptides than purely atom-level descriptions, since peptide activity is frequently driven by residue identity, order, and context.131 Accordingly, residue-level notations such as FASTA,128 PLN,217 HELM,130 and BILN129 are highly relevant for embedding workflows. Although FASTA provides a simple encoding for canonical peptide sequences, the PLN extends the representation to a broader range of modified peptides, and HELM offers a richer formalism for complex biomolecules, including cyclic, branched, and crosslinked peptides, while BILN improves the human readability of HELM-derived peptide descriptions.128,130
A good encoding scheme preserves the relevant chemical content while producing inputs that are compatible with modern deep-learning pipelines.218 For residue-based representations, the usual workflow begins by tokenizing the sequence into amino acids or modified monomers, followed by conversion into machine-readable vectors. In the simplest case, one-hot encoding represents each residue by a sparse binary vector, typically defined over the alphabet of canonical amino acids, although the vocabulary can be expanded to include non-canonical residues and common chemical modifications. More informative sequence encoders instead map each token to a dense learned vector, allowing the model to capture contextual dependencies, long-range interactions, and position-dependent effects across the peptide chain. In parallel, some peptide-focused pipelines use property-informed residue encodings, in which each amino acid is represented by physicochemical descriptors such as hydrophobicity, charge, steric parameters, or polarity-related indices, thereby injecting biochemical priors that can be especially helpful when data are limited or when interpretability is desired.
Atom-based molecular representation formed by strings of characters, such as tokenized SMILES, peptide/biopolymer notations like CHUCKLES (monomer-sequence SMILES translation), or robust grammars such as SELFIES, a typical workflow starts by converting the string into tokens and then mapping those tokens to numerical vectors suitable for neural networks.3 These tokens are processed by neural architectures that can capture contextual dependencies along the sequence, such as long-range residue interactions and position-dependent effects. Through training, the model internalizes statistical regularities of sequence composition and structural tendencies, yielding embeddings that reflect both local and global sequence organization (Fig. 7, panel A). Neural architectures specifically designed for graph processing operate directly on these topological structures, learning embeddings that encode local chemical environments as well as global connectivity patterns.219 These representations incorporate structural constraints, stereochemical relationships, and bond-level information, making them particularly suitable for computational tasks. In the one-hot encoding, each symbol in the amino acid alphabet is represented by a sparse binary vector with a single “1” at the index of that symbol and “0” elsewhere, and each peptide is represented by a vector of length N, where N is often the 20 canonical amino acids, but it can be expanded to include non-canonical residues and common chemical modifications. In parallel, peptide-focused pipelines often incorporate property-informed encodings, where each residue is mapped to a vector of physicochemical descriptors (e.g., hydrophobicity, charge-related indices, steric parameters, etc.). These encodings inject biochemical priors that can be helpful when data are limited or when interpretability of residue contributions is desired.220,221 For graph-based representations of peptides, commonly derived from structural formats such as SDF, MOL, MOL2, and CDX, the atoms are represented as nodes and the bonds as edges, the encoding typically requires building an adjacency matrix to capture atomic connectivity and a node feature matrix to describe atom-level attributes; in many cases, edge features are added as well to encode bond properties such as type, order, or aromaticity (Fig. 7, panel B).212
For interpretability and exploratory analysis, low-dimensional projection techniques such as PCA, UMAP, or t-SNE may be applied to the higher-dimensional embeddings (latent feature space). These methods are used solely for visualization purposes, enabling qualitative assessment of similarity relationships, neighborhood structures, and clustering tendencies.222,223 They do not define the embedding itself; rather, they provide a reduced-dimensional view of the latent space. The chemically meaningful representation remains in the higher-dimensional latent space produced by the trained neural encoder.222,223
Recent computational workflows increasingly rely on embeddings from learned representations rather than other classes of descriptors, because embeddings can encode sequence context in a way that better reflects both biological function and indirectly structural constraints.214,224 This shift was catalyzed by large protein language models (pLMs)225 such as ESM-1b207 and the Rostlab ProtTrans family (e.g., ProtBERT226 and ProtT5209), which transform a protein or peptide sequence into dense vectors that summarize informative patterns across the entire chain. In practice, these models are pretrained with self-supervised objectives, learning the likelihood of residues given their surrounding context and producing contextualized embeddings at the residue and/or sequence level.227 Because pLMs are trained on massive sequence collections such as UniRef50,228 the resulting embeddings can be reused as general-purpose features to be implemented in different computational tasks.229,230 Importantly, recent studies indicate that pLM-derived embeddings can also be effective for peptides, frequently matching or outperforming traditional representations based on composition and physicochemical descriptors in predictive modeling and similarity analyses.230,231 Recently, specific learned representations for peptides have been developed to predict peptide properties, including PeptideCLM229 and Multi-Peptide.232
Quorum-sensing peptides are signaling molecules that enable communication within bacterial communities and coordinate their behavior based on population density. The QSPs regulate various physiological activities, including biofilm formation and virulence factor production. These peptides play a crucial role in this communication, often functioning as autoinducers that bind to specific receptors on neighboring cells, triggering a cascade of gene expression changes.47,124 Several studies have focused on analyzing their molecular properties to create prediction models of these peptides.124,151 However, these molecules have also been shown to have selective BBBP properties48 as well as to interact with mammalian cells, selectively promoting cancer metastasis,173,233 influencing immune,234 and muscle235 cells. According to Wynendaele et al. (2015), the chemical space of quorum-sensing peptides is divided into three main clusters, as indicated by analyses of principal components. The peptide size and compactness comprise the first cluster. The descriptors that illustrate these characteristics include the radial distribution function (RDF), Burden eigenvalues (BEH, BEL), Randic shape indices, autocorrelation descriptors (ATS, GATS), weighted holistic invariant molecular (WHIM) descriptors, Balaban index, and the lopping centric index. Furthermore, the chemical space is also influenced by lipophilicity and hydrophilicity, evaluated through log
P values, tPSA, and the counts of HBD and HBA, along with connectivity indices that account for peptide cyclization and descriptors related to HOMA, AROM, and ARR aromaticity, which define the second principal component. The third principal component is characterized by S-evaluating descriptors representing thiol groups, thiolactones, or disulfides, while the fourth principal component emphasizes the presence and frequency of nitrogen bonds (N–N, N–O, and N–C). As a result, peptides high in cysteine and methionine cluster together, whereas those with basic amino acids and amides, such as asparagine form another cluster. Investigating the brain influx and efflux properties of three chemically diverse QSPs, Wynendaele et al. (2015) identified, according to clustering of the PCA results, three peptides named PhrCACET1, BIP-2, and PhrANTH2. These QSPs were investigated using a multiple-time regression technique in an in vivo mouse model (ICR-CD-1) to assess blood–brain transfer characteristics. The authors discovered that these peptides show blood–brain barrier (BBB) permeation, as well. The PhrCACET1 exhibited a notably high initial influx into the mouse brain (Kin = 20.87 µl g−1 min−1), whereas the brain penetrabilities of BIP-2 and PhrANTH2 were determined to be low (Kin = 2.68 µl g−1 min−1) and very low (Kin = 0.18 µl g−1 min−1), respectively.48 These findings directly implicate the chemical characterization of peptide space and demonstrate the existence of an intersection not been characterized previously.
Similarly, the CPPs have been identified with blood–brain barrier permeation.47 For example, de Oliveira et al. (2021) identified that CPPs possess higher MW, tPSA, and NRB values compared to clinically approved peptides, suggesting that their mechanisms of membrane penetration may involve processes beyond passive diffusion, such as pore formation or endocytosis. Additionally, their findings emphasize the importance of molecular flexibility and specific structural features, such as hydrogen bond patterns and the presence of aromatic rings, in influencing the permeability of these peptides, which could be related to their stereoselectivity. Regarding the B3PPs, Cavaco et al. (2024) recently identified key molecular determinants for peptides effectively crossing the BBB: a slightly hydrophobic nature, with a mean hydrophobic residue content of approximately 35%; a small size, with an average molecular weight of 2046 g mol−1; few or no aromatic residues, indicated by an average molar absorptivity of 3790 M−1 cm−1 at 280 nm, which corresponds to 1–2 tyrosine or 0–1 tryptophan residues; and a slightly cationic charge, with an average net charge of +2. The study emphasizes that not all CPPs can function as B3PPs, as the overlap between these two families is minimal. Experimental validation demonstrated that four newly identified B3PPs exhibited high translocation abilities in vitro and greater brain accumulation in vivo than established B3PPs, highlighting the importance of specific physicochemical characteristics for effective brain targeting.236
Despite the cell membrane and BBB having highly diverse functional and chemical compositions, and various molecular mechanisms of permeation being described for molecules into these membranes, most predictive models typically attribute passive transport through the membrane as the most important mechanism.58 Furthermore, the biophysical interaction with these membranes has been explored as a key factor for permeation. Therefore, it is interesting to note that some molecular features usually applied to predict the CPPs have also been pointed out as relevant to predicting B3PPs. For example, Dichiara et al. (2019) established a set of chemical descriptors to facilitate the successful prediction of BBB permeation. They evaluated statistically 328 compounds, correlating their experimental in vivo log
BB values with various computed descriptors. They constructed contingency tables, calculated observed and expected distributions, and analyzed the relationships between descriptors and BBB permeation. The authors identified a significant influence of nine specific physicochemical properties on BBB permeation, including polar surface area, nitrogen and oxygen count, log
P, nitrogen count, log
D, oxygen count, ionization state, hydrogen bond acceptors, and hydrogen bond donors.237
Despite both classes representing distinct biological activity, a previous study showed that chemically distinct CPPs named pVEC, SynB3, Tat 47–57, transportan 10 (TP10), and TP10-2 exhibit varying abilities to enter the BBB. Specifically, Tat 47–57, SynB3, and pVEC demonstrated significantly high rates of unidirectional influx, whereas the transportan variants displayed minimal to low brain penetration.47
The concept of chemical space gains particular significance when applied to peptides because their amino acid sequence intrinsically encodes and influences physicochemical and structural properties such as solubility, hydrophobicity, folding patterns, and three-dimensional conformation, which ultimately shape their biological activities. The unique characteristics of peptides, especially their conformational flexibility, further underscore the complexity of their chemical space, as these molecules can adopt distinct conformational states depending on the environment, which is often crucial for their biological function. This flexibility contributes to their ability to interact with diverse biological targets and, in some cases, to penetrate biological barriers. The choice of molecular representation strongly determines which aspects of peptide behavior become computationally accessible. Classical molecular descriptors and fingerprints remain essential for interpretable, scalable, and cost-effective analyses, particularly in virtual screening strategies. However, recent advances in machine learning have expanded this landscape by enabling embeddings derived from learned encoders, such as GNNs, LMs, and AEs, which can extract informative features directly from raw sequence or structural data. These learned representations organize peptides in latent spaces where similarity, clustering patterns, and predictive relationships become more readily exploitable, thus providing a powerful complement to conventional descriptor-based strategies.
In addition, peptide bioactivity should not be interpreted independently of conformation and context. Features such as chameleonicity, secondary-structure propensity, backbone flexibility, polarity masking, and membrane-interaction mechanisms reinforce the notion that peptide function emerges from a dynamic relationship between structure and environment. This complexity also helps explain why apparently unrelated peptide classes may partially overlap in chemical space and bioactivity, revealing intersections that are biologically meaningful and potentially useful for peptide discovery, repurposing, and design. Moreover, the overlap among distinct peptide classes with pleiotropic activities, such as QSPs, CPPs, and B3PPs, suggests shared regions of chemical space that warrant further investigation and may reveal new peptide functions as well as biotechnological and therapeutic opportunities.
629::AID-CEM606
3.0.CO;2-M.| This journal is © the Owner Societies 2026 |