Group SELFIES: A Robust Fragment-Based Molecular String Representation

We introduce Group SELFIES, a molecular string representation that leverages group tokens to represent functional groups or entire substructures while maintaining chemical robustness guarantees. Molecular string representations, such as SMILES and SELFIES, serve as the basis for molecular generation and optimization in chemical language models, deep generative models, and evolutionary methods. While SMILES and SELFIES leverage atomic representations, Group SELFIES builds on top of the chemical robustness guarantees of SELFIES by enabling group tokens, thereby creating additional flexibility to the representation. Moreover, the group tokens in Group SELFIES can take advantage of inductive biases of molecular fragments that capture meaningful chemical motifs. The advantages of capturing chemical motifs and flexibility are demonstrated in our experiments, which show that Group SELFIES improves distribution learning of common molecular datasets. Further experiments also show that random sampling of Group SELFIES strings improves the quality of generated molecules compared to regular SELFIES strings. Our open-source implementation of Group SELFIES is available online, which we hope will aid future research in molecular generation and optimization.


Introduction
The discovery of functional molecules for drugs and energy materials is crucial to tackling global challenges in public health and climate change.Different types of generative models can suggest potential molecules to synthesize and test, but the performance of the models and molecules heavily relies on the underlying molecular representation.Several models generate molecules represented as SMILES strings [1][2][3][4][5], but their generated output can be invalid due to syntax errors or incorrect valency.SELFIES [6] is a molecular string representation that overcomes chemical invalidity challenges by ensuring that any string of SELFIES characters can be decoded to a molecule with valid valency.This not only makes it a natural representation for chemical language models that output molecular strings, but also for genetic algorithms such as GA+D, STONED, and JANUS [7][8][9] for molecular optimization.SELFIES improves string-based molecular generation by encoding prior knowledge of valency constraints into the representation independently of the optimization method.The representation has been shown to improve distribution learning by language models [10], as well as image2string and
string2string translation [11,12] and molecular generation in data-sparse regimes [13].Additionally, simple add/replace/delete edits to SELFIES strings can generate new but similar molecules, enabling genetic algorithms that directly manipulate strings to generate molecules [7].Alternatively, guiding these simple string edits with Tanimoto similarity can interpolate between molecules as performed in STONED by Nigam et al. [8], which can then be applied as crossover operations in genetic algorithms such as JANUS [9].Molecular interpolation has also been used to find counterfactual decision boundaries that explain a molecular classifier's decisions [14].
While SMILES and SELFIES represent molecules at the individual atom and bond level, human chemists typically think about molecules in terms of the substructures that they contain.Human chemists can distinguish molecular substructures based solely on the image of a molecule and induce the molecular properties those substructures usually imply.Many fragment-based generative models take advantage of this inductive bias [15][16][17][18][19][20][21][22][23][24] by constructing custom representations amenable to fragment-based molecular design.However, these approaches are not string-based, thereby losing desirable properties of string representations: easy manipulation, and direct input into established language models.Similar to how SELFIES incorporates prior knowledge of valency constraints, we can also incorporate prior knowledge in the form of functional groups and molecular substructures into the representation.In this work, we combine the flexibility of string representations with the chemical robustness of SELFIES and the interpretability and inductive bias of fragment-based approaches into a novel string representation: Group SELFIES, a robust string representation that extends SELFIES to include tokens which represent functional groups or entire substructures.
In Section 2, we discuss how Group SELFIES fits into related research and then formally introduce the representation in Section 3. Specifically, we outline how molecules are encoded into and decoded from Group SELFIES, and we show that arbitrary Group SELFIES strings can be decoded to molecules with valid valency.The representation enables users to easily specify their own groups or extract fragments from a dataset, leveraging the wide area of cheminformatics research available there.In Section 4, we find that Group SELFIES is more compact than SMILES or SELFIES and improves distribution learning.Additionally, we compare molecules generated via randomly sampling SELFIES and Group SELFIES strings and find that Group SELFIES improves the quality of generated molecules.Molecular generation via random sampling provides greater emphasis on the representation itself by abstracting away the complexities of the type of generative method used, which we leave to future work as described in Section 5.
Overall, Group SELFIES provides the flexibility of group representation, the ability to represent extended chirality via chiral group tokens and chemical robustness as summarized in    [30] allows for "macro atoms" which specify multiple atoms in a substructure.The Hierarchical Editing Language for Macromolecules (HELM) [31] represents complex biomolecules by declaring monomers and then connecting them in a polymer line notation.Human-Readable SMILES [32] applies common abbreviations for chemical substituents to process and compress SMILES strings into a more human-readable format.SMILES Pair Encoding [33] breaks down SMILES strings by tokenizing them in a data-driven way that recognizes common substructures.
Genetic Programming: A string representation of molecules such as SELFIES can be thought of as a programming language where programs specify how to construct molecules.Genetic programming [34] uses genetic algorithms to design programs that fulfill desired constraints.In particular for linear genetic programming [35], programs are represented as linear sequences of atomic instructions, rather than as a tree of expressions.Linear sequences of atomic instructions are easily mutated by changing any instruction in the sequence, since any sequence of atomic instructions will still be a runnable program.In this way, SELFIES and Group SELFIES can be thought of as domain-specific languages for linear genetic programming.
Learned Grammars: Data-Efficient Graph Grammar Learning (DEG) [22] is a recent approach for extracting useful formal graph grammars from small datasets of molecules.In this context, a "useful grammar" means that molecules generated by applying random applicable production rules usually have high scores.The learned production rules of the graph grammar can be thought of as similar to functional groups applied in Group SELFIES.Group SELFIES allows for flexibility and fine control of substructures, which can extracted from any procedure including DEG.

SELFIES Framework
Before introducing in Group SELFIES in greater detail, we summarize the primary features of SELFIES and the reasons underlying its chemical robustness.SELFIES is equipped with an encoder and a decoder.The encoder takes in a molecule and converts it to a SELFIES string, and the decoder takes in a SELFIES string and converts it to a molecule.To encode a molecule in SELFIES, one traverses its molecular graph and outputs the processed traversal as a string of SELFIES tokens.
To decode a SELFIES string, one reads through the string token-by-token, building the molecular graph along the way until arriving at the finished graph.Since the encoding and decoding process alone does not guarantee chemical robustness, the SELFIES decoder further includes two important features: 1.Each token in SELFIES is overloaded to ensure that it can be interpreted sensibly in all contexts.For instance, all tokens in SELFIES can also be interpreted as numbers, which is useful when expressing branch and ring lengths.
2. SELFIES keeps track of the available valency at each step in the decoding process; if a bond would be formed that would exceed this valency, it changes the bond order or ignores the bond.For instance, when decoding By preserving these properties in the Group SELFIES decoder, we ensure chemical robustness is preserved.

Basic Tokens in Group SELFIES
Group SELFIES strings consist of the following fundamental tokens: • [X] adds an atom with the atomic symbol X.
• [Branch] creates a new branch off the current atom and saves the current atom as a branchpoint to return to later, and is analogous to an opening parenthesis ( in SMILES. [pop] exits the current branch, returning to the most recent branchpoint, and is analogous to a closing parenthesis ) in SMILES.Unlike in SMILES, however, [Branch] and [pop] tokens need not come in pairs, which helps maintain robustness.This token is also different from the [BranchX] tokens in SELFIES.Experiments in Appendix A.5 indicate this change does not substantially affect the performance of Group SELFIES.
• [RingX] indicates that a ring bond will be formed from the current atom.The next X tokens immediately following [RingX] will be interpreted as a number N , and we will count backwards N atoms in placement order to determine the target of the ring bond.For example, [Ring2] indicates that the next 2 tokens will be interpreted as a 2-digit base-16 number N .Ring bonds are stored until after all tokens have been read by the decoder; only then are ring bonds placed, and only if it would not violate valency.Due to the addition of groups, it is sometimes necessary to form ring bonds to atoms that are added after the current atom (e.g.ring bonds within groups).
To indicate this, we use the [->] token before the number token to specify that we will count forwards instead of backwards.
All tokens can be modified by adding =, #, \ or / to change the bond order or stereochemistry of their parent bond (e.g.[#Branch] or [/C]).The parent bond is the bond to the previous atom.

Groups
The primary difference between SELFIES and Group SELFIES is the addition of groups.Each group is defined as a set of atoms and bonds representing the molecular group with its attachment points, indicating how the group can participate in bonding.Each attachment point has a specified maximum valency, which allows us to continue tracking available valency while decoding.These attachment points are labeled by attachment indices, and the encoder and decoder will navigate around these attachment indices as described in Section 3.4.
Users must specify the groups they want to use using a dictionary that maps group names to groups.This tells the encoder what groups to recognize, and tells the decoder how to map group tokens to groups.We call this dictionary a "group set", and every group set defines its own distinct instance of Group SELFIES.In particular, the decoder will not recognize a Group SELFIES string that contains group tokens not present in the current group set.
To distinguish group tokens from other tokens, we include a : character at the front of the token (e.g.[:4benzene]).All group tokens are of the form [:S<group-name>], where S is the starting attachment index of the group, and <group-name> is any alphanumeric string that does not start with a number.
Optionally, a priority value can be specified for each group, indicating the priority with which the group should be recognized when encoding into Group SELFIES.Priority affects the Group SELFIES encoder as described in Section 3.4.For each group, one can also specify its index overload value, which is the value the group token takes when the decoder must interpret the token as a number.

Encoding and Decoding
Encoding: To encode a molecule in Group SELFIES, the encoder first recognizes and replaces substructure matches of groups from the molecule.By default, the encoder iterates through the group set and recognizes the largest groups first, but users can override this by specifying a priority for each group.Setting a high priority value for a group indicates that it will be recognized first when encoding into Group SELFIES, ensuring that other group replacements will not overlap with this group.This encoding strategy iterates implies that increasing the size of the group set will increase the running time of the encoder linearly.We then traverse the graph similar to the encoding process for SMILES and SELFIES, while also placing the correct tokens for tracking the attachment indices of where the encoder entered and exited a group.
Decoding: When decoding Group SELFIES, the process is essentially the same as regular SELFIES except when reading group tokens.When a group token is read by the decoder, the group set dictionary determines the corresponding group.Subsequently, all atoms of the group are placed and the main chain is connected to the starting attachment point.The decoder selects the next attachment point to branch off from by reading in the next token as a relative index.By adding the current attachment index to a relative index modulo the total number of attachment points in the group, the decoder selects the next attachment point.From the specified attachment point, the decoder implicitly branches off of the group and continues traversing until a [pop] token is read.Once the branch is "popped", the decoder returns to the group and can navigate to the next attachment point using another relative index.If the selected attachment point is occupied, then the next available attachment point is used.If all attachment points have been used up, then the group itself is immediately "popped", returning to the most recent branchpoint before the group was placed.
We verified the robustness of Group SELFIES by encoding and decoding 25M molecules from the eMolecules database [36].We provide a detailed example of encoding/decoding the molecule celecoxib in Appendix A.1.
Group SELFIES manages chirality differently than SMILES and SELFIES.Rather than use @notation to specify tetrahedral chirality, all chiral centers must be specified as groups.We provide an "essential set" of 23 groups which encode all relevant chiral centers in the eMolecules database.Equipped with this essential set, every molecule can be encoded-decoded while maintaining chirality.
It is also an option to not use the essential set, or only use a subset of it, depending on what chiral centers are relevant to the problem at hand.If a molecule has a chiral center not specified in the group set, then encode-decode will not preserve chirality.

Determining Fragments
Group SELFIES has a built-in flexibility for assigning the set of fragments that make up a group set.Hence, the construction of a useful group set often remains an open design choice.Users can specify groups using a SMILES-like syntax, which could be useful if one knows what groups are synthetically available or are expected to be useful for their particular design task.Fragments can also be obtained from several fragment libraries found in the literature [37][38][39].Generally, a useful set of groups will appear in many molecules in the dataset and replace many atoms, with similar fragments merged together to reduce redundancy.
In our experiments, we also tested various fragmentation algorithms that extract fragments from a dataset, including a naïve technique that cleaves side chains from rings and a method based on matched molecular pair analysis [40].Several other fragmentation algorithms from cheminformatics can be readily applied [22,[41][42][43][44][45] and the essential features of Group SELFIES outlined in Section 1 remain robust to different fragment discovery methods.

Experiments
The experiments in the subsequent sections outline some of the advantages of Group SELFIES compared to SMILES and regular SELFIES representations.Concretely, we show that: (1) Group SELFIES is shorter and more compressible than SMILES and SELFIES; (2) Group SELFIES preserves useful properties during generation; (3) Group SELFIES improves distribution learning.

Compactness
Group SELFIES strings are typically shorter than their SMILES and SELFIES equivalents when using a generic set of groups.In Figure 4, this generic set was generated by taking a random selection of 10,000 molecules from ZINC-250k [46] and fragmenting them into 30 useful groups using various algorithms (see Determining Fragments).We then combined these 30 groups with the 23 groups of the essential set. Figure 4 shows histograms of the lengths of SMILES/SELFIES/Group SELFIES strings of the entire ZINC-250k dataset.Length is the number of characters in SMILES strings, and the number of tokens in (Group) SELFIES strings.Group SELFIES strings are usually shorter than their SELFIES and SMILES counterparts because group tokens can represent multiple atoms in a molecule.Since Group SELFIES has a larger alphabet than SMILES or SELFIES, we estimate the complexity of each representation with the compressed filesize of ZINC-250k.We find that out of all representations, Group SELFIES can be compressed the most (see Appendix A.2).

Molecular Generation
Algorithm 1 Sampling random (Group) SELFIES strings To specifically compare the suitability of the SELFIES and Group SELFIES representations for molecular generation, we use a primitive generative model which samples random strings.First, we convert a subset of N = 100,000 molecules from ZINC-250k into (Group) SELFIES strings.Then, we tokenize all strings and combine them into a single bag of tokens.To generate a new string, we first pick a random (Group) SELFIES string from our chosen subset and take its length l.We then randomly sample l tokens from the bag, and concatenate into a generated string.We generate N random strings for each representation.For Group SELFIES, we use the same 53 groups used for the length histogram in Section 4.1.
We show histograms of the SAScore [47] and QED [48] of molecules generated from ZINC in Figure 5.The distributions of generated Group SELFIES more closely overlap with the original ZINC dataset than the generated SELFIES, showing that even with an extremely simplistic generative model, Group SELFIES can preserve important structural information.We perform a similar analysis for a dataset of nonfullerene acceptors (NFA) [49] in Appendix A.3 and find that Group SELFIES preserves many aromatic rings in contrast to SELFIES, which rarely ever preserves aromatic rings.

Distribution Learning
To further quantify the effectiveness of Group SELFIES in generative models, we use the MOSES benchmarking framework [50] to evaluate variational autoencoders (VAEs) trained with both Group SELFIES and SELFIES strings.Models were trained for 125 epochs.The group set for the Group SELFIES VAE was created by fragmenting the training set provided by MOSES and selecting the 300 most diverse groups.A set of 100,000 molecules was then generated from each model and evaluated on the metrics provided by MOSES.For most metrics, Group SELFIES performs approximately the same as SELFIES.Validity is the percentage of generated molecules that are accepted by RDKit's parser.Uniqueness is the percentage of generated molecules that are not identical to any other generated molecule.Similarity to nearest neighbor (SNN) is the average Tanimoto similarity between generated molecules and the nearest neighbor in the reference set.Fragment similarity (Frag) is a cosine similarity based on the distribution of BRICS fragments [45] of generated and reference molecules.Scaffold similarity (Scaf) is a cosine similarity based on the distribution of Bemis-Murcko scaffolds [51] of generated and reference molecules.Internal diversity (IntDiv) measures the chemical diversity of the generated molecules using Tanimoto similarity.Filters is the fraction of generated molecules that pass filters for unwanted fragments.Novelty is the fraction of generated molecules not in the training set.
The Group SELFIES model performs especially well on the Fréchet ChemNet Distance (FCD) metric [52] when compared to SELFIES.FCD measures the difference between the activations of the penultimate layer of ChemNet (a model trained to predict the bioactivity of molecules) in the validation set and in the generated set.Due to how ChemNet was trained, the activations are likely to encode a mixture of biological and chemical properties important to drug likelihood.This makes comparing these activations more informative than comparing standard properties like logP or molecular weight, where the correlation to bioactivity is weaker and less deliberate.To visualize FCD, some indices of the penultimate activations of ChemNet are graphed in Figure 6.Generated Group SELFIES match these distributions more closely than generated SELFIES.

Discussion
Our experiments show that Group SELFIES has noticeable advantages compared to SMILES and SELFIES representations, including greater readability provided by the group tokens.With regards to SMILES, the primary advantage is chemical robustness.The comparison with SELFIES is more nuanced, as discussed in the section below.

Group SELFIES vs SELFIES
Substructure Control: Group SELFIES provides more fine-grained control of substructures, which creates the following advantages: (1) An important scaffold can be preserved during optimization; (2) Chiral and charged groups can be preserved during optimization, ensuring that charged tokens do not proliferate and create radicals; (3) Synthetically accessible building blocks can be chosen as groups to improve synthesizability.
Substructure Control with SELFIES: Various techniques applied to SELFIES can mitigate the challenges of preserving structure.One such example is to simply combine substrings of SELFIES strings together.Indeed, further experiments in Appendix A. 4 show that simply replacing all group tokens by their SELFIES substrings shows similar performance to Group SELFIES.Within the SELFIES framework, however, an inserted substring will not necessarily have that exact substructure when decoded because the first token of the inserted substring may need to be interpreted as a number, which can have cascading effects for the rest of the substring.It is also likely that upon further insertions, that substructure will not be preserved.Additionally, it is also not clear how an insertion based approach can create groups with 3 or more branches, since creating a third branch requires insertion in the middle of the group substring.
Extended Chirality: Group SELFIES is theoretically capable of representing molecules with extended chirality which are traditionally not able to be represented with SMILES and SELFIES.These representations can only handle local chirality -that is, chirality with a single atom as the chiral center.This is in contrast to global chirality, where there may be an axis or plane of chirality.
Group SELFIES can handle global chirality by taking an entire complex or chiral substructure and abstracting it into a group, leaving attachment points on the outside for varying functionalization.
Figure 7 shows examples of groups with local and global chirality that Group SELFIES can handle.We leave the proper implementation of representing global chirality to future work.
[6]helicene Computational Speed: One tradeoff of Group SELFIES is that encoding and decoding is usually slower than with SELFIES, likely due to overhead of RDKit operations.The encoder is particularly slow as it relies on performing a substructure match for every group in the group set.The decoder is faster than the encoder, though still slower than the SELFIES decoder.See Appendix A.6 for timing.
To improve computational performance in future work, one could exploit substructure control of Group SELFIES to reduce the number of encode/decode calls needed to obtain high performers.
Additionally, the speed of encoding and decoding operations can be improved with distributed computing, since Group SELFIES is trivially parallelizable for a fixed group set.

Future Work
One promising extension of Group SELFIES is to incorporate more flexibility into the representation.In such a case, a group token can represent an entire scaffold, except without the atom identities.
Other tokens can then identify the atom types on the scaffolds.This would allow optimization of the atom types while maintaining the topological structure of the scaffold.Another current limitation of Group SELFIES is that groups cannot overlap; more work is needed to develop a representation that acknowledges how groups might overlap, particularly for generating polycyclic compounds.A sequence-based representation of cellular complexes [53] or hypergraphs might suggest a promising direction.
Finally, given this paper's focus on the molecular representation, we only applied rather simple generative modeling methods.We hope that future work can leverage Group SELFIES to perform molecular generation with more advanced generative methods, including chemical language models, deep generative models, and evolutionary methods.

A Appendix A.1 Example
As an example, we will go through encoding and decoding the molecule celecoxib.
Figure 8: Celecoxib We will define our group set as consisting of 4 main groups: trifluoromethane, toluene, benzene, and sulfonamide.
Figure 9: Group set used in this example.
We will now extract occurrences of the groups in the molecule.Since priorities were not specified, we match groups by size in descending order.
We will now traverse the graph.The following diagrams demonstrate this process.
We then place an index token of [+2] to go from the 0th attachment point to the 2nd attachment point.For readability, index tokens are as numbers, but in the actual encoder output a token with the appropriate overload value would be used (e.g. in this case [Ring2] is overloaded to [+2]).
We again find that our current atom is in a group.This time, we place the group token with the starting attachment point at the 2nd attachment index, since that is where the last bond entered from.
We then use the [pop] token to exit the current group.In this case, we go from the trifluoromethane group back to the pyrazole group.
The next few steps out the carbon chain of the benzene using atomic tokens, including placing a branch and a ring bond.
The decoding process for this Group SELFIES string will read in token-by-token, building the molecular graph in the same order as shown in this encoding example.

A.2 Complexity and compression
To take into account the alphabet size of SMILES, SELFIES, and Group SELFIES, we compress each representation using zlib to estimate the complexity of each representation.We first do index encoding to convert each string into a list of ints by replacing each unique token with a unique int.We then compress this representation using zlib.3: Filesize of ZINC-250k when represented in SMILES, SELFIES, and Group SELFIES.When using strings, SMILES takes up the least space.When using index encoding and then compressing using zlib, Group SELFIES takes up the least space.We repeat Experiment 4.2 but for a dataset of nonfullerene acceptors (NFA) [49].This dataset contains many conjugated aromatic systems.In this case, we histogram by SAScore and the number of aromatic atoms, as shown in Figure 10.Generated SELFIES are rarely able to preserve the aromatic systems found in NFA, but generated Group SELFIES preserve much more of the aromatic systems, and can produce molecules with lower and higher SAScore.When binning by number of aromatic atoms, molecules with 0 aromatic atoms are omitted for clarity.About 49% / 98% of molecules generated via Group SELFIES / SELFIES have 0 aromatic atoms.Group SELFIES uses a group set containing 74 groups obtained via naïve fragmentation.N = 51281 molecules are generated via Group SELFIES / SELFIES, equal to the size of the NFA dataset.

A.4 Substring SELFIES
Substring SELFIES is generated by taking a Group SELFIES string and replacing every group token with a SELFIES substring corresponding to it.We compare Substring SELFIES and Group SELFIES in the context of Experiment 4.2 in Figure 11 and find that generated Substring SELFIES are on par with generated Group SELFIES.
Figure 11: Generated Group SELFIES and Substring SELFIES binned by SAScore and QED.Molecules were generated from ZINC-250k.

A.5 No-Group SELFIES
No-Group SELFIES is generated by using the Group SELFIES encoder with an empty group set.We compare regular SELFIES to No-Group SELFIES to determine the effect of small changes in the base encoding process, such as branch tokens.In SELFIES, all [BranchX] expect a number token after to determine the length of the branch, whereas in Group SELFIES, all [Branch] tokens do not need a number token, instead using a [pop] token to exit branches.In Figure 12, generated No-Group SELFIES are on par with or worse than generated SELFIES, indicating that the modified encoding process does not contribute much to the overall performance of Group SELFIES.

Figure 2 :
Figure 2: Visual explanation of Group SELFIES encoding/decoding of celecoxib.Top-left: molecular structure of celecoxib.Top-middle: the structure of celecoxib colored by its groups and atoms, with arrows and attachment indices indicating how the encoder and decoder navigate around the groups.Top-right: index overload table in Group SELFIES, indicating how tokens are interpreted as numbers.Bottom: Celecoxib represented in Group SELFIES.Tokens are colored by the groups and atoms they act on.Index overloads are shown where interpreted.Arrows indicate how branches return to their original branchpoints and how rings form bonds with previously placed atoms (light-blue).

Figure 3 :
Figure 3: Representation of a possible "benzene" group and its use in a corresponding group token.The *N notation represents an attachment point with valency N.Each attachment point is labeled with its attachment index 0th,1st...The starting attachment point represented in the token is also highlighted.

Figure 4 :
Figure 4: Histogram of lengths of SMILES, SELFIES, and Group SELFIES strings of the ZINC-250k dataset.Here, Group SELFIES uses a group set of 53 groups.

Figure 5 :
Figure 5: Molecules generated by our primitive generative model are binned by SAScore and QED.For both properties, generated Group SELFIES have greater overlap with the original ZINC distribution.Bracketed values indicate the Wasserstein distance (a measure of overlap) to the ZINC distribution.Dashed lines indicate the means.

Figure 6 :
Figure 6: Distribution of some values from the second-to-last layer of ChemNet for molecules generated by Group SELFIES and SELFIES compared to the validation set.The difference in distributions is used to calculate FCD.Bracketed values in the legend represent the Wasserstein distance to the original MOSES distribution.

Figure 7 :
Figure 7: Examples of chiral groups that can be represented in group SELFIES.Tetrahedral carbon, hexoses, and octahedral metal centers have local chirality, whereas allenes, metallocenes, helicenes, and substituted biphenyls have global chirality.

A. 3 Figure 10 :
Figure 10: Generated Group SELFIES and regular SELFIES binned by SAScore and number of aromatic atoms.Molecules were generated from the NFA dataset.

Figure 12 :
Figure 12: Generated regular SELFIES and No-Group SELFIES binned by SAScore and QED.Molecules were generated from ZINC-250k.

Table 1 :
. Comparison of the capabilities of SMILES, SELFIES, and Group SELFIES.Group SELFIES provides group representation, representation of extended chirality and chemical robustness.

Table 2 :
Group SELFIES VAE and SELFIES VAE evaluated on MOSES metrics.The Group SELFIES VAE mostly matches or outperforms the SELFIES VAE.

Table 4 :
Running time of encoder/decoder for SELFIES and Group SELFIES.The SELFIES encoder is 65x faster than the Group SELFIES encoder, and the SELFIES decoder is 19x faster than the Group SELFIES decoder.Timing is averaged per molecule over the ZINC-250k dataset, using the same group set of 53 groups as the length histogram experiments.Computations were done on a 2021 Apple MacBook Pro with a M1 Pro chip and 32 GB RAM.