Open Access Article
Luis Armando Gonzalez-Ortiz*a,
Lisset Noriegaa,
Filiberto Ortiz-Chi
b,
Gabriela Vidales-Ayalaa,
Emmanuel Soberanis-Cáceresa,
Amilcar Meneses-Viveros
*c,
Alan Aspuru-Guzik
*d and
Gabriel Merino
*a
aDepartamento de Física Aplicada, Centro de Investigación y de Estudios Avanzados, Unidad Mérida, km 6 Antigua Carretera a Progreso, Apdo. Postal 73, Cordemex 97310, Mérida, Yucatán, Mexico
bSecihti-Departamento de Física Aplicada, Cinvestav-IPN, Antigua Carretera a Progreso km 6, Mérida, Yucatán 97310, Mexico
cDepartamento de Computación, Centro de Investigación y de Estudios Avanzados, Unidad Zacatenco, Av. IPN No. 2508, Apdo. Postal 07000, Col. San Pedro Zacatenco, CDMX, Mexico
dDepartment of Chemistry, University of Toronto, DB 421D, Lash Miller Chemical Laboratories, 80 St. George Street, Toronto, ON M5S 3H6, Canada. E-mail: gmerino@cinvestav.mx; luis.gonzalezo@cinvestav.mx; alan@aspuru.com; amilcar.meneses@cinvestav.mx
First published on 13th November 2025
The redundancy of SMILES notation, where multiple strings can describe the same molecule, remains a challenge in computational chemistry and cheminformatics. To mitigate this issue, we introduce TokenSMILES, a grammatical framework that standardizes SMILES into structured sentences composed of context-free words. By applying five syntactic constraints (including branch limitations, balanced parentheses, and aromaticity exclusion), TokenSMILES minimizes redundant SMILES enumerations for alkanes while maintaining valence and octet compliance through semantic parsing rules. TokenSMILES does not replace SMILES but rather formalizes its syntax into a standardized, machine-interpretable form. This grammatical structure enables controlled generation and manipulation of valid SMILES strings, ensuring syntactic and semantic consistency while substantially reducing redundancy. Implemented into SmilX, an open-source tool, TokenSMILES generates valid SMILES with accuracy comparable to existing computational implementations for molecules with low hydrogen deficiency (HDI ≤ 4). Its applicability extends beyond alkanes through stoichiometric modifications such as bond insertion, cyclization, and heteroatom substitution. Nevertheless, challenges remain for highly unsaturated systems, where canonicalization artifacts highlight the need for dynamic feasibility checks. By integrating linguistic principles with cheminformatics, TokenSMILES establishes a scalable framework for systematic chemical space exploration, supporting applications in drug discovery, materials design, and machine learning-driven molecular innovation.
![]() | ||
| Fig. 1 2-Aminomethylbenzoic acid: (a) molecular model and (b) selected SMILES strings representing the same molecule, illustrating the standardization problem addressed by TokenSMILES. | ||
Table 1 shows syntactic variations,6 including Kekulé, aromatic, branching, and ring number/dot bond syntax, using 2-(aminomethyl)benzoic acid (C8H9NO2) as an example.
| Syntax style | Representative SMILES |
|---|---|
| Kekulé syntax | NCC1 CC CC C1C(=O)O |
| Aromatic syntax | NCc1ccccc1C( O)O |
| Branching syntax | N(C(C1( C(C( O)O)(C( C(C( C1))))))) |
| Ring numbers and dot bond syntax | c13c4cccc1.C4( O)O.N2.C23 |
Recent advances have sought to overcome these limitations by introducing alternative representations with improved syntactic control. For instance, DeepSMILES simplifies parenthesis handling by adopting postfixed ring-numbering rules,7 while t-SMILES encodes functional groups explicitly, avoiding parentheses and ring indices.8 BigSMILES extends the notation to polymers through the use of braces to denote stochastic binding patterns,8,9 whereas CurlySMILES embeds annotations within braces { } to describe noncovalent or coordinated structures, while preserving the core SMILES grammar.10
Beyond these structural adaptations, new languages such as SELFIES,11 Group SELFIES,12 and JAM13,14 have emerged. SELFIES guarantees that every token sequence corresponds to a chemically valid structures, thereby minimizing parsing and valency errors. Group SELFIES refines this idea by representing rings or functional groups as single tokens, simplifying substructure encoding. In contrast, JAM adapts SMILES-like syntax to describe stacking sequences in crystalline or layered materials, combining chemical and geometric information. Table 2 summarizes these languages using 2-(aminomethyl)benzoic acid as an example, highlighting improvements in grammatical clarity and robustness.
| Notation | Representation |
|---|---|
| SMILES | NCc1ccccc1C( O)O |
| DeepSMILES | NCcccccc6C O)O |
| t-SMILES | c1([1*])c([2*])cccc1^[1*]C( O)O^[2*]CN |
| SELFIES | [N][C][C][ C][C][ C][C][ C][Ring1][=Branch1][C][=Branch1][C][ O][O] |
| Group SELFIES | [:benzene][Branch][:CH2NH2][pop][Branch][:COOH][pop] |
In this work, we introduce TokenSMILES, a grammatical and graph-theoretical framework that formalizes the SMILES language, together with SmilX, its open-source implementation. SmilX applies the TokenSMILES grammar to generate and validate molecular structures based on explicitly defined syntactic and semantic rules. Both the conceptual framework and its software implementation are presented here for the first time.
The proposed grammar provides a standardized representation for organic molecules that retains the descriptive capacity of SMILES while minimizing ambiguity. By treating molecular strings as complete sentences governed by grammatical rules rather than collections of discrete symbols, TokenSMILES enables systematic computational analysis and exhaustive isomer enumeration (Fig. 2). Implemented in SmilX, this framework departs from matrix-based approaches (e.g., MOLGEN,15 MAYGEN16) and block-based methods (e.g., SMILIB17,18), offering a more structured and linguistically consistent paradigm for molecular representation.
![]() | ||
| Fig. 2 General workflow diagram for the development of the TokenSMILES grammatical framework and the validation of the number of structures. | ||
![]() | ||
| Fig. 3 (a) Molecular model for 2,3-dimethylbutane; (b) hydrogen-free molecular graph with labeled carbon atoms; (c) traversal pathway (orange and purple arrows) covering all atoms. | ||
(1) Hydrogen removal. Generate a hydrogen-free molecular graph (Fig. 3b).
(2) Atom labeling. Assign unique numerical labels to non-hydrogen atoms (Fig. 3b).
(3) Path definition. Define an ordered set W representing the transversal path. In Fig. 3c, the path P = {(C3, C2), (C2, C4), (C4, C6)} corresponds to W = {C3, C2, C4, C6}.
(4) Branch identification. Identify branches Bm, where each branch contains the maximum possible number of connected atoms. The ordered set of branches is B = {B0, B1,…, Bm}. For the system in Fig. 3c, B0 = {C4, C5} and B1 = {C2, C1}.
(5) Branch insertion. Insert the branches B = {B0, B1} into W, omitting the first atom of each branch since it already appears in W. Parentheses mark the branch boundaries:
| {C3, C2, (, C1,), C4, (, C5,), C6} | (1) |
(6) Symbol replacement and concatenation. Replace the atomic labels in (1) with the corresponding atomic symbols and concatenate them to obtain the string CC(C)C(C)C.
The resulting SMILES string, CC(C)C(C)C, represents 2,3-dimethylbutane with implicit hydrogen atoms. Since SMILES grammar is non-commutative, the steps must be performed in the specified order.
This tokenization follows two sequential rules: First, the original string is parsed into individual characters, each enclosed in square brackets. For CC(C)C(C)C, this yields [C, C, (, C, ), C, (, C, ), C], maintaining the exact order of symbols in the original notation. In this representation, all symbols are enclosed within square brackets to form an ordered sequence. Although conventional set notation implies unique elements, here repeated tokens are intentionally preserved to retain positional information.
Second, the tokens are categorized according to their syntactic context. Left-context symbols, [, (, = , and #, are placed immediately before atomic symbols, while right-context symbols, @, ), %, ], and a digit n, are placed immediately after them. Applying these rules to the example yields the standardized TokenSMILES form [C, C, (C), C, (C), C].
| W = {a0, a1, …, an−1} or W = [a0, a1, …, an−1], |
![]() | (2) |
joints the strings in W, and the arrow (→) indicates the production process.
Applying Constraint 2, a0, a1 and an−1 are replaced by “C” without branch symbols, resulting in:
![]() | (3) |
![]() | (4) |
For simplify, the inner sequence
is replaced with a variable Ω, rewriting (4) as:
![]() | (5) |
To define the instances in Ω, Constraint 4 is applied. Each element in {a3, …, an−3} has four possible forms: [C|(C|C)|(C)]. Balanced parentheses are maintained through recursive rules.
The first rule, q0 → C|Cq0, generates chains such as “CC”, “CCC”, and “CCCC”. The second, q1 → (C)|(C)q1, introduces terminal branches represented by parentheses. Combinations of the two are obtained using q2 → q0|q1|q0q1|q1q0, allowing permutations of linear and branched fragments. Nested branches are introduced through q3 → (CC)|(Cq3C)|(Cq2C), which ensures balanced parentheses within multiple levels of branching. Finally, the general case is described by q4 → q2|q3|q2q3|q3q2, encompassing all possible balanced expressions in [C|(C|C)|(C)]. Substituting q4 into (5) gives:
![]() | (6) |
Using (6), the number of SMILES representations for CnH2n+2 isomers decreases drastically. For example, Fig. 4 shows the construction of SMILES for C6H14 using (6) with [q4] = [C|(C)]:
![]() | (7) |
![]() | ||
| Fig. 4 (a) A generator tree for the SMILES strings of the C6H14 system and (b) the progress of the production rule (7) to obtain the atomic symbols in the SMILES. | ||
However, (7) does not account for atomic equivalence or valence restrictions, which may result in redundant (e.g., [C, C, (C), C, C, C]
[C, C, C, C, (C), C]) or chemically invalid strings (e.g., [C, C, (C), (C), (C), C]). To address these limitations, the next section introduces the semantic parsing of TokenSMILES, which filters out chemically inconsistent structures and enforces the octet rule.
| Context(α) = (p0, p1, p2) |
As an example, the TokenSMILES [C, C, (C), C, (C), C] can be indexed as [C0, C1, (C2), C3, (C4), C5]. The chemical context for each substring is summarized in Table 3, where the first element identifies the atomic symbol (“C”), the second indicates the valence (4), and the third lists the corresponding connectivity set.
| String | Chemical context |
|---|---|
| C0 | (C, 4, {(0,1)}) |
| C1 | (C, 4, {(0,1), (1,2), (1,3)}) |
| (C2) | (C, 4, {(1,2)}) |
| C3 | (C, 4, {(1,3), (3,4), (3,5)}) |
| (C4) | (C, 4, {(3,4)}) |
| C5 | (C, 4, {(3,5)}) |
![]() | ||
| Fig. 5 Procedure for extracting bonds from the TokenSMILES of 2,3-dimethylbutane. (a)–(f) Depict the sequential steps of the method. | ||
(a) “C0” is not concatenated with any other token, so its connectivity set remains empty.
(b) “C1” is concatenated with “C0”, creating a new bond represented by the tuple (0, 1).
(c) The same procedure is applied to the remaining tokens. Fig. 5c shows how (C2) is concatenated, using parentheses to indicate the start and end of a branch emerging from “C1”, adding the tuple (1, 2).
(d) “C3” is then concatenated, forming the tuple (1, 3) rather than (2, 3), since “(C2)” is a closed branch.
(e) “(C4)” is concatenated next, adding the tuple (3, 4).
(f) Finally, “C5” is concatenated, producing the tuple (3, 5).
After processing all tokens, the resulting bond set is {(0, 1), (1, 2), (1, 3), (3, 4), (3, 5)}, corresponding to [C, C, (C), C, (C), C]. Each tuple represents the connectivity in the TokenSMILES notation.
As shown in Fig. 5a, the bond contexts initially consist of empty sets. As the strings are concatenated according to the SMILES grammar and defined constraints, their contexts expands as new bonds are introduced. From this stage onward, the strings listed in Table 3 are treated as context-free strings or simply words.22 These words lack explicit connectivity to other atoms, excluding hydrogen. Consequently, a TokenSMILES can be interpreted as a sequence of such words forming a chemically meaningful sentence.
To show this process, we return to the example of the C6H14 isomers. Using production rule (7), connectivity was assigned to each TokenSMILES string, and those that violated the octet rule were removed, yielding seven valid SMILES candidates (Fig. 6). To eliminate duplicates, each TokenSMILES was converted to a canonical SMILES form using the canonicalization algorithm implemented in the RDkit module.24 Strings sharing the same canonical form were identified, and only unique SMILES were retained. After filtering, five distinct SMILES remained, corresponding precisely to the five constitutional isomers of C6H14.
![]() | ||
| Fig. 6 Production of valid SMILES for the C6H14 isomers. Unlike Fig. 5, the invalid TokenSMILES [C, C, (C), (C), (C), C] is omitted here because it does not satisfy the octet rule. | ||
Integrating grammatical constraints into the Kekulé syntax transforms variable-length isomer strings into standardized, fixed-length representations, reducing redundancy and ensuring syntactic coherence. For example, classical SMILES enumeration of C6H14 yields 125 valid strings, whereas TokenSMILES generates only seven normalized candidates: [C, C, C, C, C, C], [C, C, C, C, (C), C], [C, C, C, (C), C, C], [C, C, C, (C), (C), C], [C, C, (C), C, C, C], [C, C, (C), C, (C), C], and [C, C, (C), (C), C, C], each conforming to a consistent six-word structure (Table 4).
| TokenSMILES | SMILES |
|---|---|
| String lengths normalized to 6 words | String length varies from 6 to 14 symbols |
| [C,C,C,C,C,C] | CCCCCC, CCCCC(C), CCCC(CC), CCC(CCC), C(CC)CCC, C(CC)(CCC), C(CC)CC(C), C(CC)C(CC), C(CCC)(CC), C(CCC)CC, C(CCC)C(C), C(CCCC)C, C(CCCC)(C), C(CCCCC), C(C)(CCCC), C(C)CCC(C), C(C)CC(CC), C(C)CCCC, CC(CCCC), C(C)C(CCC) |
| [C,C,C,C,(C),C] [C,C,(C),C,C,C] | CCCC(C)C, CCCC(C)(C), CCC(C(C)C), CC(C)CCC, CC(C)CC(C), C(CC)(C(C)C), C(C(C)C)(CC), C(C(CCC)C), C(C(C)CCC), C(C(C)C)C(C), C(C(C)C)CC, C(CCC(C)C), C(CCC)(C)C, C(CCC)(C)(C), C(CC(C)C)C, C(CC(C)C)(C), C(C)(CCC)(C), C(C)(CCC)C, C(C)(C)(CCC), C(C)(C)C(CC), C(C)CC(C)(C), C(C)(C)CC(C), C(C)C(C(C)C), C(C)CC(C)C, C(C)(C)CCC, CC(CCC)C, CC(CC(C)C), CC(CCC)(C), CC(C)(CCC), CC(C)C(CC), C(CC)C(C)C, C(C)(CC(C)C) |
| [C,C,C,(C),C,C] | CCC(C)CC, CCC(C)C(C), CCC(C)(CC), CCC(CC)C, CCC(CC)(C), C(CC)(C)CC, C(CC)(CC)C, C(CC)(CC)(C), C(C(CC)CC), C(CC)(C)(CC), C(C(CC)C)C, C(C(C)CC)C, C(C(C)CC)(C), C(CC(C)CC), C(CC(CC)C), C(C)(CC)(CC), C(C)(C(CC)C), C(C)(CC)CC, C(C)(CC)C(C), C(C)(C(C)CC), CC(CC)CC, CC(CC)C(C), C(C)C(C)(CC), C(C)C(C)CC, C(C)C(C)C(C), C(C)C(CC)C, C(C)C(CC)(C), CC(C(C)CC), CC(C(CC)C), CC(CC)(CC), CC(C(C)(C)C), C(CC)(C)C(C) |
| [C,C,(C),C,(C),C] | C(C(C(C)C)C), C(C(C)C)(C)C, C(C(C)C)(C)(C), C(C(C)C(C)C), C(C)(C(C)C)(C), C(C)(C(C)C)C, C(C)(C)(C(C)C), C(C)(C)C(C)(C),C(C)(C)C(C)C, CC(C(C)C)(C), CC(C(C)C)C, CC(C)C(C)C, CC(C)(C(C)C), CC(C)C(C)(C) |
| [C,C,(C),(C),C,C] [C,C,C,(C),(C),C] | CCC(C)(C)C, CCC(C)(C)(C), C(CC)(C)(C)C, C(C(CC)(C)C), C(CC)(C)(C)(C), C(C(C)(C)CC), C(C(C)(C)C)C, CC(C)(C)C(C), C(C(C)(CC)C), C(CC(C)(C)C), C(C(C)(C)C)(C), C(C)(CC)(C)(C), C(C)(CC)(C)C, C(C)(C(C)(C)C), C(C)(C)(CC)(C), CC(C)(C)CC, C(C)(C)(C)(CC), C(C)(C)(CC)C, C(C)(C)(C)CC, C(C)(C)(C)C(C), CC(CC)(C)C, C(C)C(C)(C)C, C(C)C(C)(C)(C), CC(CC)(C)(C), CC(C)(CC)C, CC(C)(CC)(C), CC(C)(C)(CC) |
The modification of TokenSMILES stoichiometry is achieved through the analysis of nesting levels and adjacency sets, which describe the implicit connectivities within a string. Each word in a TokenSMILES carries a grammar-based index that defines its nesting depth and adjacency relations. By editing these relations while preserving valence consistency, the algorithm modifies molecular topology in a rule-based manner. This approach allows direct structural transformations, such as the insertion of rings or double bonds, from the TokenSMILES representation without relying on IUPAC nomenclature or predefined templates (see SI for details).
To illustrate the process, we begin with 2,3-dimethylbutane (C6H14) and modify it to generate 1-cyclopropylideneethanol (C5H8O, Fig. 7). First, the connectivity set is extracted as {(0, 1), (1, 2), (1, 3), (3, 4), (3, 5)}. Using these bonds, the algorithm identifies adjacent words representing bonds in TokenSMILES. Words at positions 1 and 3 (Fig. 7a) are selected for double-bond insertion. To evaluate feasibility, the valence excess (Δ) is computed for each atom, defined as the difference between its valence and its degree (the number of incident bonds). When Δ ≥ 1, a “ = ” symbol is inserted into the word at the higher position in the sentences (Fig. 7b).
To prevent excessive “ = ” insertions, the hydrogen deficiency index (HDI) of the initial system (C6H14) is used as a control variable. Each additional double bond increases the HDI by one, ensuring that the connectivity modifications remain chemically valid and stoichiometrically consistent.
Next, to insert a cycle, two non-adjacent words are selected based on the same connectivity {(0, 1), (1, 2), (1, 3), (3, 4), (3, 5)}. If no bond exists between the chosen positions, the algorithm proceeds with cycle insertion. For the example in Fig. 7b, words at positions 0 and 2 are chosen, and the absence of bond (0, 2) is confirmed. The valence excess for both atoms in then evaluated; if Δ > 0 for each, cycle symbols are inserted (Fig. 7c). A random ring number between 1 and 9 is assigned and placed to the right of the atomic symbol (e.g., “C1”). For numbers 10–99, the cycle symbol is prefixed with “%” (e.g., “C%10”). Each cycle insertion increases the HDI by one (Fig. 7c).
Finally, to substitute a carbon atom with oxygen, the condition degree(C) ≤ valence(O) must be satisfied before replacement. The operation reduces the carbon and hydrogen count by one and two, respectively (Fig. 7d). For heteroatoms, the allowed are B, Br, Cl, F, I, N, O, P, and S. This approach generates all structural isomers of C5H8O (Table SI1). Once all transformations are complete, equivalent SMILES are filtered using the canonicalization algorithm in RDKit.
The workflow begins with a molecular formula provided as a string, such as CnH2n+2. Based on this input, SmilX generates all possible TokenSMILES corresponding to the defined stoichiometry and adjusts their syntax accordingly. The resulting words in each TokenSMILES are concatenated to form complete SMILES strings, which are then processed using a canonicalization algorithm to eliminate duplicates and retain unique structures. Finally, the software uses the RDKit module to generate stereoisomeric variants, returning a curated list of SMILES strings that satisfy the input molecular formula.
SmilX reproduced Elyashberg's isomer counts for most systems (Fig. 8). Minor deviations occurred when the HDI approached the total heavy-atom count: SmilX occasionally yielded one missing structure (false negative, e.g., C8H4) or 1–7 additional isomers (false positives) in systems as C8H2, C9H2, C10H2, C9H4, C10H4, C10H6, C9H8, and C10H10 (Fig. 8).
MOLGEN accurately enumerated isomers at low HDI values but showed decreased accuracy as HDI increased (Fig. 8). In contrast, MAYGEN's online version encountered memory saturation in larger systems (C9H6, C9H8, C9H10, C10H2, C10H4, C10H6, C10H8, C10H10, C10H12, C10H14, and C10H16) but reproduced Elyashberg's data for smaller, computationally feasible cases.
MAYGEN employs a canonicalization procedure distinct from RDKit's implementation of the Weininger algorithm.16 Consequently, canonical SMILES can struggle to distinguish equivalent atoms in molecules containing multiple nested rings, particularly when HDI is equal to or greater than half the number of heavy atoms. SmilX's reliance on RDKit likely accounts for the few observed false positives and negatives.
SmilX showed efficient performance, aided by disk-caching optimization. All three tools produced identical isomer counts for HDI ≤ 4, with small discrepancies emerging at higher HDI values. These deviations further support the hypothesis that RDKit's canonicalization algorithm faces difficulties in handling highly nested ring topologies.
The TokenSMILES framework effectively reduced both string redundancy and computational overhead through grammatical constraints and caching. While SmilX correctly enumerated the majority of organic systems, boundary cases where HDI approached the heavy-atom count remained problematic. These discrepancies are consistent with the theoretical limitations of canonicalization algorithms in systems containing high symmetry or multiple fused rings, rather than with specific software errors. Despite these edge-case issues, SmilX maintained low execution times, and the reuse of cached structures enabled scalable exploration of extensive chemical spaces.
Classical SMILES representations provide a foundation for cheminformatics but suffer from significant redundancy, as shown in Table 4. To overcome this limitation, the present work redefines SMILES not merely as atomic sequences but as grammatically structured sentences, a conceptual framework rooted in formal language theory. The TokenSMILES approach implements this through hierarchical syntax (word- and sentence-level organization), enforced grammatical constraints, and standardized string lengths. This reformulation reduces redundancy, ensures systematic chemical-space coverage, and facilitates new computational applications.
Unlike conventional structure generators such as MOLGEN or MAYGEN, TokenSMILES emphasizes formal representation rather than speed or memory efficiency, which justifies the omission of runtime benchmarks. Similarly, unlike cheminformatics toolkits such as RDKit, it augments rather than replaces SMILES syntax by operating on grammatical constructs. This strategy enables dynamic programming, partial-solution reuse, and exploration beyond the limitations of purely graph-based methods, while remaining compatible with evolutionary and machine-learning algorithms.
Beyond alkanes, TokenSMILES enables stoichiometric modifications such as bond insertion, cyclization, and heteroatom substitution, extending its applicability to broader organic systems. In high-HDI cases, minor misidentifications arise from RDKit's canonicalization limitations, suggesting the need for improved feasibility checks.
Currently, TokenSMILES prioritizes grammatical completeness and semantic accuracy over computational efficiency. Although benchmarking was not the focus of this work, future versions could adopt optimization strategies inspired by algebraic isomer generators. The framework is inherently compatible with machine learning due to its discrete syntax, fixed-length representations, and reusable grammatical components, which enable hybrid symbolic–neural modeling and grammatical evolution.
Treating SMILES as grammatically structured sentences introduces a new paradigm for cheminformatics, linking linguistic theory with chemical representation. This approach supports machine-learning-based molecular design and systematic chemical-space mapping. Future extensions to polymers, organometallics, and crystalline systems may open new applications in materials and drug discovery.
| This journal is © The Royal Society of Chemistry 2025 |