Extending BigSMILES to non-covalent bonds in supramolecular polymer assemblies

As a machine-recognizable representation of polymer connectivity, BigSMILES line notation extends SMILES from deterministic to stochastic structures. The same framework that allows BigSMILES to accommodate stochastic covalent connectivity can be extended to non-covalent bonds, enhancing its value for polymers, supramolecular materials, and colloidal chemistry. Non-covalent bonds are captured through the inclusion of annotations to pseudo atoms serving as complementary binding pairs, minimal key/value pairs to elaborate other relevant attributes, and indexes to specify the pairing among potential donors and acceptors or bond delocalization. Incorporating these annotations into BigSMILES line notation enables the representation of four common classes of non-covalent bonds in polymer science: electrostatic interactions, hydrogen bonding, metal–ligand complexation, and π–π stacking. The principal advantage of non-covalent BigSMILES is the ability to accommodate a broad variety of non-covalent chemistry with a simple user-orientated, semi-flexible annotation formalism. This goal is achieved by encoding a universal but non-exhaustive representation of non-covalent or stochastic bonding patterns through syntax for (de)protonated and delocalized state of bonding as well as nested bonds for correlated bonding and multi-component mixture. By allowing user-defined descriptors in the annotation expression, further applications in data-driven research can be envisioned to represent chemical structures in many other fields, including polymer nanocomposite and surface chemistry.


Polyelectrolytes
More examples for annotating polyelectrolytes are shown here. In Fig. S1a and b, polycation and polyanion copolymers are represented with charged comonomers binding to counterions from added salts. For zwitterionic polyelectrolytes, the relative abundance of positively and negatively charged counterions can vary depending on pH ( Fig. S1c and d). As discussed in the main text, this uncertainty can be avoided by deferring the specification on bonding state to a data model, which greatly simplifies annotations on a great variety of zwitterionic polyelectrolytes as illustrated in Fig. S2. Note that only the depronated state of sulfonate groups is represented in the following examples, given it must be in extreme conditions to have both pronated and depronated states for strong polyelectrolytes where a polymer is usually completely ionized.  Figure S1. Annotation of electrostatic bonds in polyelectrolyte copolymers with examples (a) strong polycation DMA-HEMA (dimethylacetamide-(hydroxyethyl) methacrylate) random copolymer esterified with excess BIBB (2-bromoisobutyryl bromide) as well as quaternized by excess methyl iodide; 51 (b) strong polyanion poly-HEMA esterified by BIBB and SBA (2-sulfobenzoic acid cyclic anhydride) in sequence; 51 and zwitterionic polyelectrolytes of a general bonding state with positively and negatively charged group in the same and different monomers: (c) betaine functionalized vinylbenzyl chloride; (d) sulfoethyl methacrylate and (methacryloyloxy) ethyl trimethylammonium chloride. [52][53] Note that in the above figure, different parts of chemical structures are coded with the same colors as their non-covalent BigSMILES strings.   Figure S2. Examples of zwitterionic polyelectrolytes represented by general state of electrostatic bonding. (a) polymeric sulfothetin, monomer of 3-(methyl{2-[(2-methylacryloyl)oxyl]ethyl} sulfaniumyl) propane-1-sulfonate; 3 (b) poly(2-methacryloyloxyethyl phosphorylcholine); 4 (c) polynorbornene based sulfobetaines; 5 (d) poly(carboxybetaine methacrylate); 6 (e) poly(carboxybetaine methacrylate); 7 (f) thiophene based conjugated polyzwitterions; 8 (g) polycarboxybetaine acrylamides; 2 (h) poly(cysteine methacrylate); 9 (i) fluorinated norbornene based polybetaine. 10 Note that in the above figure, different parts of chemical structures are coded with the same colors as their non-covalent BigSMILES strings. The single integers without brackets "i" are recursive nodes for cyclic structure.

Hydrogen bonding
To complement those included in the main text, additional examples to illustrate the annotation syntax for H-bonding are given in Figs. S3 to S5. In Fig. S3, the capability of our annotation syntax is demonstrated by distinguishing different connectivity patterns for polymers self-assembled via H-bonding. 1) The fully independent binding of two donor and acceptor sites (coded by orange, green, purple and blue in Fig. S3a) in each monomer contained by the same or different polymer requires no indexing on any hydrogen bond descriptor as any pairwise connection can potentially form. If a rigorous one-to-one binding must be specified across all monomers between closely  Given the philosophy that an annotated non-covalent bond is only viewed as a potential connection, without indexing the hydrogen bonding in panel a of Fig. S3 accounts for both scenarios depicted in panel b and c. Such concept is further illustrated in Fig. S4, where the nested indexing formalism describes the correlated bonding of multiple donor and acceptor pairs. Physical gels are formed by hydrogen bonds with small molecule crosslinkers (in Fig. S4a) as well as with functional groups of (co)monomer (i.e., from the same polymer backbone in Fig. S4b and c). In Fig. S5, the same nested annotation as that of Fig. S4 Figure S4. Annotation of physical gels formed by hydrogen bonds under a small molecule crosslinker (a) bis-maleimide crosslinked methacryl-succinimidyl functionalized poly(N-isopropylacrylamide), 13 and the same indexing of copolymer backbone (b) poly-(butyl acrylate)-co-(acrylic acid) and (c) poly-(butyl acrylate)-co-(acrylamidopyridine) 16 to reflect the connectivity patterns. Note that in the above figure, different parts of chemical structures are coded with the same colors as their non-covalent BigSMILES strings. The single integers without brackets "i" are recursive nodes for cyclic structure, while the integers enclosed with two layers of square bracket "[[i]j]" are the indices of annotated groups of bonds.    Figure S5. The use of non-covalent BigSMILES syntax to represent architecture assembled by H-bonding between pairs of Upy-like motifs: (a) helical (bifunctional ureido-s-triazines with penta-ethylene-oxide side chains) 17 and (b) looped (poly(tetramethylene glycol) of dimerized Upy ends) 11 . Note that in the above figure, different parts of chemical structures are coded with the same colors as their non-covalent BigSMILES strings. The single integers without brackets "i" are recursive nodes for cyclic structure, while the integers enclosed with two layers of square bracket "[[i]j]" are the indices of annotated groups of bonds.

Polyelectrolyte and hydrogen-bonded networks
For complicated multi-component systems, such as coacervate complexes and hydrogels, the overall system should be represented as a combination of non-covalent BigSMILES strings with bonding interactions between molecules. This formalism can be handled using additional pair of {…} as illustrated in Fig. S6. Unlike Fig. S1c and d, the association of polycation and polyanion in Fig. S6a (which exemplifies the construction of layer-by-layer assembly) occurs between different polymers with the bond descriptors of donor and acceptor of distinct stochasticity. To properly represent such "multi-component" systems, all interacting macromolecular components of strings shall be enclosed in […, …] to annotate the non-covalent binding of multiple polymers as being a supramolecular assembly. The resulting specification of the inter-component electrostatic bonds is borne out of the local BigSMILES strings that represent each component. This is different from covalent bonds, which are always local to a single BigSMILES string. Furthermore, as depicted in panels b and c, gels formed with hydrogen bond-based functional groups are annotated, which again requires the mixture formalism due to the correlated bonding of multiple components.

Coordination polymers, polyMOFs and polyMOCs
With metal ions and small molecule ligands as RUs (repeating units), coordination polymers, polyMOFs and polyMOCs can be properly annotated under the above syntax as shown in Fig. S7 and Fig. S8. The use of additional key/value pairs to annotate the coordination geometry and the number of electrons involved in metal ligand complexation is also illustrated in Fig. S8. Figure S9 shows representations of supramolecular networks for polymer ligands with metal ions acting as multivalent crosslinkers.  19 Note that in the above figure, different parts of chemical structures are coded with the same colors as their non-covalent BigSMILES strings. The single integers without brackets "i" are recursive nodes for cyclic structure, while the integers enclosed with two layers of square bracket "[[i]j]" are the indices of annotated groups of bonds.   BigSMILES strings. The single integers without brackets "i" are recursive nodes for cyclic structure, while the integers enclosed with two layers of square bracket "[[i]j]" are the indices of annotated groups of bonds.

Ferrocene-like delocalization
As a specific class of metal-ligand complexation, ferrocene-like delocalization involves interactions between a metal ion and a group of atoms with conjugated π-orbitals. By having the atom indices for delocalization (see Fig. 3) in the metal-ligand bond descriptors (i.e., View 2 of Fig. 5a in the main text), examples for polymers with such interactions are shown to be well represented here in Figs. S10 and S11.

π-π stacking
More annotation examples for π-π stacking are included in the following figures. Figure S12 shows examples where small molecules are stacked to form polymer-like assemblies with the interacting atoms represented by their complementary indices via expression |x~y. In fact, there are three distinct formalisms for atom indexing: a) x~y, where integers x and y are the two inclusive boundary values for the range of indices, with the step size of 1; b) x&y&z, where integer x, y, and z are combined by "&" and are indices of individual atom characters to include for the annotation; or c) x~y!m!n, where atoms with indices in the range of integers from x to y are included except those of index m and n (the exclamation symbol "!" is equivalent to logical operator "not"). It is worthwhile to note that the syntax here is inadequate to represent the orientation of stacking. The specific sequence of the atom indices solely reflects the users' choice on setting the ring index (first atom to start a ring).   Since stackable planar π systems can interact with each other from either above or below the plane, it is effectively bi-functional and can thus assembled to network structure as illustrated by Fig.  S13: The stacking is annotated between monomers of a polymer (panel a, b and c) as well as the polymer side or end groups (panel d and e). Note that in those examples, stacked planar π system are annotated with the identical indexing sequence except for Fig. S13a and c where side groups of aromatic rings, which will not contribute to the stacking, interrupt the sequence of indexing. Therefore, instead of indices being 1~6, the presence of ester and sulfo group interrupt the indexing of the stacked carbon in benzene ring giving rise to 1~3&8~10 in Fig. S13a and 1&6~10 in Fig.  S13c, respectively. Since the annotation expression for stacking would be anchored next to the last involved atom character (1-index atom), such that the absolute value of indices for the stacked atoms will not be affected by the location of the annotation expression.  (1) (2) (3) (8) (9) (10) (1) (3) (2) (12) (11) (10) (9) (14) (1) (1) (9) (1)

Chain folding & host-guest interactions
The annotation of complicated examples for a folded chain architecture (formed via π-π stacking) are shown in Fig. S14. Once again in Fig. S15 and S16, the potential of the developed annotation formalism is demonstrated by representing different combinations of non-covalent chemistries in building host-guest interactions with complex supramolecular architectures.   (3) (1) (2)  Figure S14. Illustration of π-π stacking patterns and atom indexing in the annotation of folded chain architecture. (a) Chain folding of polydiimide interpolated (triethylenedioxy) naphthalene-diimide with perylene-terminated poly(ethylene glycol); 49 (b) chain-folding triethylenedioxy-diimide motif with terminal pyrenyl residues in polyimides. [44][45] Note that in the above figure, different parts of chemical structures are coded with the same colors as their non-covalent BigSMILES strings. The single integers enclosed with round brackets "(i)" are the indices of annotated atoms, while the single integers without (1) (6) (1) (1) [2] (3) Figure S15. The use of non-covalent BigSMILES syntax to represent the supramolecular assemblies with (a) host-guest architecture of BPP34C10(bis(p-phenylene) crown-10)-paraquat-based analogue 46 and (b) multiple types of non-covalent bonds in Cu II centered salicylaldehyde rhodamine hydrazone derivatives. 47 Note that in the above figure, different parts of chemical structures are coded with the same colors as their non-covalent BigSMILES strings. The single integers enclosed with round brackets "(i)" are the indices of annotated atoms, while the single integers without brackets "i" are recursive nodes for cyclic structure. The integers enclosed with two layers of square bracket "[[i]j]" are the indices of annotated groups of bonds.

Biopolymers
The inclusion of non-covalent interactions into BigSMILES formalism allows for the meaningful representation of biopolymers. This improvement is achieved by explicitly labelling, for example, the electrostatic interactions and hydrogen bonding between base pairs in nucleic acids or between individual amino acids in a protein sequence. As illustrated by the binding of modified chitosan and heparin shown in Fig. S17a, multiple electrostatic bonding sites and the chirality of carbon atoms in the backbone are properly annotated. In a second example, an artificial spider silk, as illustrated in Fig. S17b, contains a stochastic poly(alanine) and poly(ethylene oxide); the chirality and hydrogen bonding in the poly(alanine) forms the observed beta-sheet formation in this bioinspired system. 54 Our group developed a code to accept a sequence of amino acid one-letter symbols that constitute a repeat unit and to output a compact string representation for the protein encoded in BigSMILES with covalent and non-covalent interactions that can be input into polymer databases like PolyDAT. The code is available on GitHub at https://github.com/olsenlabmit/BigSMILES_parser. The repeat unit for each amino acid was first expressed using this code. Non-covalent interactions were then added with the assumption that any hydrogen covalently bound to a nitrogen or oxygen can hydrogen bond with any nitrogen or oxygen with a lone pair. Moreover, electrostatic interactions are encoded for charged amino acids like arginine and histidine. After the user enters a sequence of one-letter amino acid symbols, the code iterates through each one-letter symbol, concatenates the repeat unit of the amino acid, generates a single repeat unit for the user's input sequence, and outputs the non-covalent BigSMILES. Table S1 shows an example of string outputs for three elastin-like polypeptides with different guest residues.