Chemical and linguistic considerations for encoding Chinese characters: an embodiment using chain-end degradable sequence-defined oligourethanes created by consecutive solid phase click chemistry

Sequence-defined polymers (SDPs) are currently being investigated for use as information storage media. As the number of monomers in the SDPs increases, with a corresponding increase in mathematical base, the use of tandem-MS for de novo sequencing becomes more challenging. In contrast, chain-end degradation routines are truly de novo, potentially allowing very large mathematical bases for encoding. While alphabetic scripts have a few dozen symbols, logographic scripts, such as Chinese, can have several thousand symbols. Using a new in situ consecutive click reaction approach on an oligourethane backbone for writing, and a previously reported chain-end degradation routine for reading, we encoded/decoded a confucius proverb written in Chinese characters using two encoding schemes: Unicode and Zhèng Mă. Unicode is an internationally standardized arbitrary string of hexadecimal (base-16) symbols which efficiently encodes uniquely identifiable symbols but requires complete fidelity of transmission, or context-based inferential strategies to be interpreted. The Zhèng Mă approach encodes with a base-26 system using the visual characteristics and internal composition of Chinese characters themselves, which leads to greater ambiguity of encoded strings, but more robust retrievability of information from partial or corrupted encodings. The application of information-encoded oligourethanes to two different encoding systems allowed us to establish their flexibility and versatility for data storage. We found the oligourethanes immensely adaptable to both encoding schemes for Chinese characters, and we highlight the expected tradeoff between the efficiency and uniqueness of Unicode encoding on the one hand, and the fidelity to a scripts' particular visual characteristics on the other.


Introduction
tSequenced-dened polymers (SDPs), such as polyureas, nucleic acids and peptides, have been used in applications as catalysts, foldamers, self-assembled materials, and biomaterials. 1,25][6] Lutz, Du Prez, and others [7][8][9][10] have introduced several designs of abiotic SDPs to store information precisely and efficiently.In most cases the decoding process requires tandem Mass Spectra (MS) analysis, analogous to its use in proteomics. 11,12However, de novo sequencing using MS/MS becomes increasingly challenging as the number of monomers increases.In proteomics, one typically knows the sequences being sought, and they are identied by comparison to a database. 13,14In contrast, chain-end degradation sequencing routines, such as Edman degradation, 15 are entirely de novo.7][18] The method sequentially and incrementally eliminates monomers via a 5-exotrig cyclization from the O-terminus (Fig. 1). 19Using this method, we reported the encoding of a text passage from Jane Austen's Manseld Park with a hexadecimal symbolic-code (base-16), which could be read independently without prior knowledge of the information stored. 20Further, due to the simple deconvolution process, we showed that eight 10-mer OUs can be decoded simultaneously by the use of mass-tags that sort the mixtures of OUs. 21However, the labor involved in the synthesis of each monomer, one at a time, prior to incorporation into the polymer via solid-phase synthesis, limits the number of monomers that can be incorporated.][24] The prior use of SDPs for information encoding has focused on text passages in languages written in alphabetic systems. 20,25xpanding the palette of encoding monomers available allows an exploration of novel strategies for encoding different writing systems.Binary encodings are natural in the context of computers' recognition of a simple on/off distinction.Alphabetic scripts typically consist of a small number of characters and little meaningful information based on visual similarities between those characters.However, many East Asian writing systems are logographic, where the symbols can represent whole words.The characters in such systems oen number in the tens of thousands.Morpho-syllabic Chinese characters each represent a syllable with distinct meanings, but also contain visual elements that meaningfully relate those characters to other visually similar characters.Further, the symbols historically cue aspects of the character's meaning or pronunciation, and in some cases visually disambiguate words that have the same pronunciation (homophones).Different methods of encoding and decoding Chinese characters make different decisions about what meaningful aspects of these visual relations between characters are encoded or ignored. 27ere, we apply our chemical methods to two existing encoding schemes that are attuned to different characteristics of logographic writing systems [26][27][28][29][30] to establish the SDPs' adaptability, by encoding and decoding a confucius proverb (Fig. 2).In advance of creating SDPs for encoding Chinese characters, we reviewed several linguistic approaches and selected two representative encoding schemes: Unicode and the Zhèng Mȃ (ZM) method, which privilege informational efficiency and visual delity, respectively.

Encoding and linguistic considerations
Currently, Unicode provides the most commonly employed character encoding scheme. 31This system encodes the symbols from common alphabets and syllabaries employed in the Americas, Europe, Africa, and parts of Asia, as well as the logographic symbols used traditionally, and, in some cases, currently, in several East Asian countries like China, Japan, and North and South Korea, as well as ancient world scripts such as Egyptian Hieroglyphs.Unicode posits a code space divided into myriad cells; each cell receives a unique index, a hexadecimal number known as a code point.A given cell may contain a symbol or remain (as yet) unoccupied: when occupied, the symbol in the cell is uniquely identied by the cell's code point.This schema treats alphabetic and logographic symbols equivalentlyas Unicode characters.For example, the lowercase letter z of the Roman alphabet receives the identier U+007A (where the prex 'U+' is followed by the hexadecimal code point for a Unicode character), while the Mandarin symbol (rén, benevolence) corresponds to U+4EC1. 26Distinct identiers represent distinct symbols in all instantiations of Unicode. 32nicode intends to identify individual characters uniquely and efficiently across all major World scripts.Though issues surrounding an original symbol's subsequent variation in distinct milieus persist in Unicode, its adoption presents a notable expansion beyond the more limited ASCII-based encoding used in our previous work. 19However, Unicode does not encode visual information about characters or their internal composition, but instead arbitrarily assigns codes within a particular range.Thus, similar codes rarely imply similar characters, and vice versa.If even one element of the code point is lost or corrupted, an incorrectly identied character will be unrelated to the intended character.By contrast, a system based on the visual composition of characters encodes meaningful information with each element of the code string, so that mistakes in the encoding or decoding process may still yield characters similar to the intended target.While earlier work explored efficiency, e.g., through Huffman encoding, 19 the present work seeks instead to explore the range of encoding styles supported by SDPs in an effort to spur novel approaches to preserving unique characteristics among World writing systems.
To explore SDPs' range of applicability using the same monomers, we sought a character encoding scheme capturing visual characteristics and internal composition of Chinese characters.Historically several such schemes have been used.Among the earliest, the Four-Corner (FC) method, 33 devised in the 1920s, distinguishes 10 basic stroke shapes.It encodes each character by 4 digits recording the stroke shapes in a character's Fig. 2 The workflow of this project.
four corners.However, the resulting codes are far from unique: the FC method's conict code rate (CCR, roughly the percentage of Chinese characters whose code corresponds to more than one character) approaches 85%. 34Thus, we felt the FC method was not optimal for using SDPs, where the symbol's context is unknown.
Several approaches reduce such ambiguities. 29,35The ZM method, from the early 1990s, reduces ambiguities 36 to a CCR of just over 9%. 34In contrast to the FC method, ZM foregrounds characters' internal structure and maps Chinese characters to the standard QWERTY keyboard.ZM decomposes a character into distinct compositional elements (similar but not identical to traditional 'radicals'), known as roots: groups of strokes that always appear as a unit, whether as a standalone character, or as a component repeated within numerous other characters.ZM divides roots into two classes, primary and secondary (Fig. 3).It maps primary roots to 1-letter strings of the QWERTY keyboard (the 26 letters of the english alphabet), and secondary roots to 2letter strings.The method then decomposes a character into a sequence of primary and secondary roots in le-to-right, topto-bottom order, and encodes the character by the sequence of strings corresponding to the roots.But ZM also imposes a set of rules to stipulate that no complete character code exceed 4 letters on the keyboard. 36With a list of the predened correspondences between QWERTY letters and primary or secondary roots, a user can generate the 4-letter ZM code for any Chinese character.Thus, with 4 elements over a 26-symbol base, this allows 26 4 = 456 976 potential codes, roughly 10 times the current number of Chinese characters.This makes the ZM method a visually attuned system for encoding Chinese characters which can be stored as individual 4-mer oligourethanes using a base-26 encoding capacity.
Because, unlike Unicode, ZM does not achieve total uniqueness (∼9% CCR), 34 a single Chinese character might not correspond to a single code, and vice versa.Some individual characters correspond to a variety of codes simply due to ambiguity in the order for listing the roots comprising the character: e.g., (jìn, be near) corresponds to ZM codes PDW and WPD.Considering this potential for ambiguity, one benet of an encoding scheme motivated by the visual layout of a character is that incorrectly identied characters will likely be visually similar to the intended character.Thus, while errors are more likely using ZM than Unicode, ZM errors will plausibly involve visually similar characters, rather than an entirely unrelated (and possibly not even Chinese) character, as might be the case with a Unicode error.We selected a quote (see below) to illustrate these redundancies, and we explore different heuristics needed to incorporate ZM into a viable SDP data storage workow using OUs as the example.

Oligourethane considerations
Given the different strengths and challenges of Unicode and the ZM methods for encoding Chinese characters, we set out to design an oligourethane (OU) encoding (writing) and chain-end sequencing (reading) technique for Chinese characters adaptable to both methods, where individual OUs would code for a specic logographic character.First, in both Unicode and ZM schemes, each OU would require only four to ve monomers to represent a single character, and hence the OUs could be quite short.Second, while we have already demonstrated the ability to write in hexadecimal, 37 as needed for Unicode, the ZM method uses 26 symbols and therefore would require us to synthesize 26 unique monomers.Thus, we turned to exploring in situ on-resin synthetic methods (see below) to avoid having to create unique monomers.][24] To demonstrate our synthetic approach and its ability to encode the complexity of a logographic writing system, we chose an eight-character proverb from the Analects of confucius: , 38,39 roughly "By nature [people] are near each other; by habitual action they become farther apart". 40To further probe the versatility and adaptability of the approach, we encoded the proverb in both traditional and simplied Chinese characters (Table 1).The former appear in manuscripts through the centuries, but also nd current use in Hong Kong, Taiwan, and other diaspora communities; the latter stem in part from earlier informal writing practices but were formalized over the 20th century into a system streamlined for modern writing needs in the People's Republic of China.While either system  could in theory encode either script, we utilized the ZM method to encode the more recent simplied characters and Unicode for the more numerous traditional characters.

Chemical results and advances
Our previous work using oligourethanes for encoding introduced each monomer serially using solid phase synthesis. 20his synthetic approach requires individual monomers, each of which is synthesized independently in an O-terminus activated and N-terminus protected form.As alluded to above, we envisioned creating different monomers concurrently with the oligomer synthesis, adding side chains of varying mass to a common monomer.In order to fulll this vision, we screened several reactions -Diels Alder, 41 Suzuki couplings, 42,43 thia Michael additions, 44 and copper catalyzed azide-alkyne click (CuAAC) chemistry 45 for their efficiency or reaction on resin, each of which are well-known to give high yields in solution.We found that only the copper catalyzed click was compatible with the resin we were using for the solid phase synthesis of the OUs, giving ∼95% yields, while the other reactions furnished low yields, or the resins were damaged by the reaction conditions.Therefore, we moved forward with CuAAC to achieve our goal.
Our synthesis commenced with the reduction of Fmoc-L-azidolysine to furnish compound 1, followed by activation of 1 with 4nitrophenyl chloroformate, to furnish the monomer 2 in good yield (Fig. 4).We utilized L-phenylalaninol loaded 2-chlorotrityl polystyrene resin and L-alaninol loaded 2-chlorotrityl polystyrene resin as the solid support to start the synthesis, and hence one of either of these two monomers is consistently on the O-terminus of the resulting oligourethanes.
Starting with our published conditions for oligourethane synthesis, 19 monomer 2 was rst appended to the resins.However, instead of deprotecting the Fmoc to then add another monomer, we exposed the resin (1 eq.) to 0.25 equivalent CuI, 0.5 equivalent sodium ascorbate and 0.5 equivalent tri(benzyltriazolylmethyl)amine (TBTA) for CuAAC click, along with 5 equivalents of the specic alkyne desired for encoding (see below).Then, following Fmoc deprotection, a second monomer 2 was coupled, and so on (Fig. 5), until completing the synthesis of the entire oligomer.In this manner, we could achieve an oligourethane capable of carrying as many different R groups as necessary for the mathematical base we are writing in (base-16 for Unicode, base-26 for ZM).Considering the number of mass differentiated terminal alkynes that exist in the chemistry world, the mathematical base can be substantially increased, which is a signicant advance for the eld of digital polymers and information encoding, because larger bases allow for denser information storage.
In our very rst test of the CuAAC click reaction on resin, an acceptable yield of 94% was achieved.However, our biggest concern was that accumulation of CuI and ascorbic acid over multiple cycles of organic solvents and reagents would damage the resin, possibly via Fenton-type chemistry.Because we needed to run several consecutive click reactions to have multiple different R groups on the oligourethane string, we anticipated that several exposures to CuI and ascorbic acid in conjunction with repeated swelling and shrinking of the resin throughout the steps could lead to loss of function.However, based on our results, the resins are robust enough to tolerate the repeated exposures, highlighting the power and utility of the CuAAC click chemistry. 46At the end of the synthesis, as we have previously published, the chromophore NBD was appended for analysis by LC-MS.With the reaction condition described above, we successfully synthesized 12 urethane-based oligomers (2 dimers, 2 trimers and 8 tetramers).The initial synthesis step (both coupling and click reaction) consistently proceeds smoothly.We attribute the high conversion to the absence of inorganic salt accumulation and the ready accessibility of the short chain on the resin.Generally, the conversions decrease as the number of steps increases while some truncated oligomers are observed.The conversions of the 12 oligomers ranged from 38% to 90%.Out of the 12 synthesized oligomers, 9 yielded more than 60% conversion, while only 3 urethane oligomers resulted in less than 50% conversion, which, unsurprisingly, were all tetramers.However, one of the tetramers (oligomer 2) yielded an 81% conversion, which is notably close to the conversions observed for dimers and trimers.This illustrates that the reactivity of different click reaction partners (alkynes) inuences the conversions, in addition to the number of steps.As this is a consecutive reaction without stepwise purication needed in the process, the stepwise conversions are not calculated.It's worth noting that only small amounts of materials (<1 mg) are required for sequencing step aer the target oligomers are made.Hence, we do not collect the entire sample from the HPLC, nor do we calculate a yield because the resin loading is oen variable and imprecise, just as with solid-phase peptide synthesis where yields are routinely not reported.
We rst used Unicode for traditional Chinese characters.Molecular-level encoding in hexadecimal required that each symbol be represented by appending a single alkyne, of sixteen, as a coupling partner on the azido side chain of a monomer along the oligomer backbone.Therefore, a library of sixteen different commercially available mass-separated terminal alkynes was identied (Fig. 6).Two chemical principles were used to guide library design.First, the masses of all the terminal alkynes differed by at least 2 atomic mass units to enable robust differentiation by LC-MS.Second, no reactive nucleophilic functional groups were present, thereby avoiding side-reactions during urethane coupling.During the building of this library, it was quite easy to identify 32, 64, and 128 commercially available alkynes that t our criteria, which speaks to the future possibilities for highly dense information storage using this approach to writing.
Table 2 shows the hexadecimal Unicode code points for the traditional Chinese characters of the proverb discussed above.
The individual hexadecimal symbols were assigned in 1-to-1 fashion to sixteen different alkynes (Fig. 6).Aer assigning monomer to code points, we successfully synthesized the required eight oligomers (see ESI III(d)(1) †) via a combination of consecutive solid phase CuAAC clicks and urethane coupling reactions, followed by prep-HPLC for purication.The Oterminus of each OU starts with the resin preloaded alaninol (labeled with # in the sequence) or phenylalaninol (labeled with * in the sequence), which we have reported acts as a convenient indexing tool (Ala index or Phe index ) to start reading of the mass spectra. 19he eight oligomers were sequenced in a 2 : 1 MeOH/H 2 O mixture with K 3 PO 4 at 70 °C and submitted to LC-MS analysis at specic intervals for a period of 4 h.As a representative example, Fig. 7 shows that chain end degradation removes each monomer from the O-terminus, thus truncating the oligomers iteratively.27 out of 32 masses were observed clearly and distinctly.The precursor 4 mers #8fd1, #7fd2, #9060 overlapped with one of their truncated oligomers in the low-resolution LC-MS conditions due to their similar polarity.It is worth noting that the length of the truncated oligomers does not correlate with the polarity, resulting in disordered retention times for each moiety from an LC trace.However, one can easily identify which LC peaks grew and diminished in sequence over time.Using mass spectrometry, we could easily observe +1 and +2 charged moieties, facilitating identication of all the moieties by intensity difference of oligomers/truncated oligomers and the mass differences (see ESI III(c) †).
Having thus decoded the stored Unicode code points, we notate the hexadecimal codes in a Python list.We then feed this to a short Python function in a Jupyter notebook developed in house which prints the characters corresponding to the Unicode code points, reconstructing the original text with no errors nor any biased foreknowledge of the proverb, as in our previous work. 19,20ith the success of our encoding of traditional Chinese characters with Unicode, we moved to encoding the same proverb in simplied Chinese characters with the ZM method. 47hus, the proverb was converted to a base-26 symbolic system simply by increasing our library to 26 terminal alkynes (Fig. 8).In addition, ZM only requires four letters as a maximum code length but permits shorter codes.This provides opportunities for employing single and short string oligourethanes (e.g., 2-mers, 3mers, or 4-mers) to encode a single character.When including the indexing monomer, this led to three 2-mers, two 3-mers and three 4-mers (Table 3), corresponding to the eight simplied Chinese characters (see ESI III(d)(2) †).The synthesis of the OUs was performed as for Unicode encoding, by iterative couplings, deprotections, solid phase CuAAC clicks, and capping with NBD.Cleavage from the resin was performed with 1% triuoroacetic acid (TFA) in dichloromethane (DCM) for 10 min.Purication with HPLC was performed before sequencing.
As with the Unicode oligomers, we sequenced these oligomers concurrently via chain-end degradation in a 2 : 1 MeOH/ H 2 O mixture with K 3 PO 4 at 70 °C in a heated shaker.These reactions were monitored by LC-MS every 60 min for 4 h.23 of 24 masses (three 2-mers, two 3-mers and three 4-mers) were observed clearly and distinctly in the 470 nm channel under the generalized low-resolution LC-MS conditions (Fig. 9).The precursor 4-mer #bdrw overlapped with one of its truncated oligomers.As we discussed above, a lack of resolution between  the precursor and a truncated oligomer does not cause any issues.We ran the decoding process with the in-house soware to uncover the information stored within the oligourethanes.Specically, the resulting ZM codes were passed to a Python list and fed to a specic function in the Jupyter notebook to render the appropriate Chinese characters, with additional heuristics described below to deal with ambiguous or non-unique ZM codes.Once again, the workow returned the proper Chinese text with no errors with no foreknowledge of the proverb.

Associated soware for the linguistic considerations
As mentioned above, to assist the encoding and decoding phases of the chemical procedures, we created simple routines (i.e., functions) in Python, available not only via a commandline script, but also via a Jupyter notebook to facilitate portability and transparency (see the Zhengmadication (https:// github.com/LingResCtr/zhengmadication)repository on GitHub: https://github.com/LingResCtr/zhengmadication).The system imports the RIME correspondences between Chinese characters and their ZM codes to create a database in memory.A function then reads in a text string containing the desired phrase, isolates the individual characters, and converts each of these to the corresponding ZM code in the database.These codes are then assigned the monomers and their sequence in a corresponding OUs over a 26-character base, thus storing the text chemically.Decoding follows a similar procedure.Readout from OU chain-end degradation produces a collection of alphabetic codes up to 4 letters long, and another Python routine takes these codes as inputs and outputs the corresponding Chinese characters from the ZM database.The accompanying programs also include similar routines for encoding and decoding Unicode, though these are vastly simpler because Python works natively with Unicode and already includes many helper functions to support such encoding and decoding.
A principal motivation of our foray into the ZM encoding was to open the door to applying visually based encoding systems to information storage in a chemical modality, irrespective of the use of oligourethanes.In this regard, ZM's occasional lack of uniqueness provides a novel challenge to chemical encoding.To overcome the obstacles posed and create a straightforward map from text to chemical storage and back, we explored the use of heuristics.Where a single code does not uniquely specify a single Chinese character, the redundancy can derive from the encoding of multiple-character phrases.We therefore only chose single-character correspondences, omitting multiplecharacter strings.And when a single character corresponds to more than one code, this oen derives from "shortcuts": i.e., additional shorter codes to represent a character.We therefore  restricted consideration to the longest code available for any given character: e.g., BRW, WBR, and BDRW can all represent (yuȃn, be far), and so we choose the longest, BDRW.This remained practical because the oligourethane synthesis routine is so simple.Nevertheless, ZM also retains same-length ambiguities such as PDW and WPD for (jìn, be near), and BDRW and WBRD for (yuȃn, be far).Resolution of such cases required an additional heuristic, applying alphabetical order and choosing the rst code: thus, we chose BDRW over WBRD for (yuȃn, be far).
With these heuristics, we succeeded in closing the encoding loop: text is input and converted uniquely to ZM codes, which are then converted to unique OUs.Conversely, upon chain-end degradation a sequence of ZM codes arises, these codes are then converted to unique Chinese characters to reproduce the original text (harken back to Fig. 2).While our heuristics allowed correct identication of all characters, the lack of uniqueness introduces the possibility of incorrect identication of characters in the decoding process.But a distinct advantage of a visually based encoding scheme like ZM is that such errors will oen visually approximate the target character: considering the code BDRW for (yuȃn, be far), if we had misread the nal W as D, we would have obtained BDRD for (yuán, rst); or misreading the nal W as G gives BDRG for (wán, obstinate), containing the same central element present in .Thus, if the wrong character is selected, that character may share visual similarities with the target character, helping a competent reader to infer the correct intended character (though, as Table 3 shows with codes YI and YT, respectively and , this similarity has limits).Finally, we note that the Python scripts automated the process of siing through correspondence tables, matching Chinese characters with the corresponding Unicode or ZM codes; this step could be performed manually, avoiding the computer's binary altogether.Only the Unicode and ZM encodings are inherent to the procedure.

Conclusions
Sequence-dened oligourethanes (OUs) were specically designed to encode Chinese characters.Exploring the affordances of OUs for the encoding/decoding of Chinese characters required a collaborative effort between chemists and linguists.This led us to explore two different encoding schemas, allowing us to establish the exibility and versatility of our group's use of OUs (or SDPs generally) for data storage using current industrystandard character encodings (Unicode), but also to explore their potential to store character encodings that preserve visual details of the data encoded in novel, and more application-specic, ways (ZM).Visually motivated encoding schemes like ZM fall short of Unicode in terms of uniqueness, but because every element of a ZM code is motivated by the visual character being encoded, each element provides information about the character's shape, potentially allowing greater exibility for information retrieval in situations of corrupted or incomplete encoding.The ZM method required an expansion to base-26, and therefore we developed an in situ synthetic method that generates the monomers on-the-y during the oligomer synthesis, and which could readily be expanded to much larger mathematical bases in the future.The information-encoded oligourethanes were generated, sequenced by liquid chromatography mass spectroscopy (LC-MS), and deciphered using our in-house developed soware, coupled with various heuristics to sort out ZM coding redundancies.The workow of soware to design the OU sequences for the Unicode and ZM methods, chemical synthesis and sequencing, and soware deciphering of the MS data with blind foreknowledge of encoded message, returned the proverb with no errors for each encoding method.Thus, we found the oligourethanes immensely adaptable to both encoding schemes.

Fig. 3
Fig. 3 Example of encoding a Chinese character with the Zh èng M ȃ (ZM) method.The method decomposes the character into ( èr, code: BD) in red, ( ér, code: RD) in blue, and (zhī, code: W or WA) in black.Using the code W or WA for and interpreting this as the bottom (i.e.last) element, this yields the code BD + R(D) + W(A) = BDRW for the entire character.But writing as W and interpreting this as the leftmost (i.e.first) element, the same components yield the code W + B(D) + RD = WBRD.

Fig. 6
Fig. 6 Library of 16 terminal alkynes for click chemistry for Unicode.

Fig. 7
Fig. 7 (a) The Unicode code point for the corresponding Chinese character, and the associated oligourethane.(b) The LC trace of sequencing oligourethanes with K 3 PO 4 , reaction was heated at 70 °C in a microwave.(c) The corresponding exact masses of the oligomer each truncated oligomer (see corresponding mass spectra in ESI †).

Fig. 8
Fig. 8 Library of 26 terminal alkynes for click chemistry for the Zh èng M ȃ encoding.

Fig. 9
Fig. 9 (a) The ZM code and corresponding Chinese character, and the associated oligourethane.(b) The LC trace of sequencing oligourethanes with K 3 PO 4 , reaction was heated at 70 °C in a microwave.(c) The corresponding exact masses of the oligomer and each truncated oligomer (see corresponding mass spectra in ESI †).

Table 1
Chinese symbols to be encoded, traditional and simplified, pinyin (pronunciation), and english translations

Table 2
Unicode code points assigned for traditional Chinese characters in the proverb

Table 3
Zh èng M ȃ code assigned for simplified Chinese characters in the proverb