Open Access Article
Victor
Flors
a,
Raquel
Cerveró
a,
Cristopher
Tinajero
b,
Victor
Sans
b and
Cristian
Vicent
*c
aPlant Immunity and Biochemistry Laboratory, Biochemistry and Molecular Biology Section, Department of Biology, Biochemistry and Natural Sciences, Universitat Jaume I, Castelló, Spain
bInstitute of Advanced Materials (INAM), Universitat Jaume I, Av. Sos Baynat s/n, Castelló 12071, Spain
cServeis Centrals d'Instrumentació Científica Universitat Jaume I, Av. Sos Baynat s/n, 12071 Castelló, Spain. E-mail: barrera@uji.es
First published on 7th May 2025
Encoding abstract information in chemical mixtures uses the selective presence or absence of specific analytes, creating a binary-based framework for data storage. Data storage capacity (C in bits) can be maximized by encoding with large analyte libraries (M) at distinguishable concentration levels (L), where C = M·log2L. However, roboust decoding of such complex libraries remains challenging for practical applications. This study introduces hyphenated mass spectrometry (MS) methods, liquid chromatography (LC) and flow injection analysis (FIA) that meet the dual requirements of high analyte coverage and precise quantitation to maximize data storage capacity. Encoding and decoding use plant metabolite libraries to create specific mixtures. Using LC-MS, it is feasible to encode and decode up to 200 bits per mixture, with scalability reaching 103–104 bits at the cost of low decoding rates (ca. 0.5 bits per sec). FIA-MS offers a high-throughput alternative, handling 100 bits at faster rates (ca. 3 bits per sec). The data storage capacity can be three-fold expanded by incorporating up to eight quantitation levels, supporting binary, quaternary, or octal encoding schemes. To demonstrate the practical application of these methods, we encode and decode various digital file formats such as texts and multicolor images.
Most studies utilize laser desorption ionization (LDI) methods coupled with mass spectrometry (MS) for abstract information encoding and decoding. Highly efficient data assembly, achieved by positioning chemical mixtures on pre-defined μm2 sized arrays on LDI plates, enables these methods to achieve high information storage densities.16,21,22 However, the limited analyte coverage (the number of identified metabolites) inherent in standalone mass spectrometry constrains the attainable storage capacity per mixture to a few tens of components. This is critical for encoding large datasets in minimal physical space. The selection of analytes based on distinct molecular characteristics (such as NMR chemical shifts, fluorescence, or Raman shifts) also allows for straightforward encoding and decoding of information using analytical tools like gas chromatography (GC-FID),23 fluorescence,24 Raman spectroscopy25 or nuclear magnetic resonance (NMR).23,26,27 Table S1† provides an overview of analyte coverage achieved by these encoding and decoding methods. For example, GC-FID can encode around 20 compounds per mixture. Raman spectroscopy and fluorescence-based approaches are restricted to less than 10 analytes per mixture due to signal overlap. NMR, while useful, requires larger sample quantities than MS and presents challenges such as solubility issues and signal overlap (particularly for 1H detection), which could hinder the construction of large analyte sets.
To meet the increasing demands for storage capacity, high-throughput performance, and reliable data readout while maintaining operational simplicity, alternative methods for molecular data storage are being explored. Hyphenated mass spectrometry (MS) techniques represent a promising approach to expand analyte coverage in a single analysis. In particular, liquid chromatography coupled with mass spectrometry (LC-MS) can potentially detect thousands of metabolites in a single analysis.28–30 For example, chemical mixtures consisting of thousands of synthetic peptide-based molecules can reliably identified by LC-MS.31 Following the selection of appropriate library compounds that meet criteria for ESI amenability and effective separation in both the m/z axis and retention time, the encoding of substantial amounts of information (Kb scale) within a single chemical mixture may become feasible. LC-based separation bears the disadvantage of negatively affecting sample throughput, necessitating the compromise between throughput and analyte coverage desired for a specific encoded message. Alternatively, flow injection analysis coupled with mass spectrometry (FIA-MS) offers a streamlined approach that involves transient sample injection into a continuous solvent carrier directly connected to the electrospray ionization (ESI) source of the mass spectrometer.32 FIA-MS provides lower analyte coverage (typically detecting hundreds of metabolites per run) but offers simplicity, rapid analysis times, and comparable sensitivity and accuracy to LC-MS.33,34 Despite the high diversity of commercially-available LC-MS platforms and their widespread use in routine analysis, FIA-MS and LC-MS methods remain largely unexplored for data storage applications based on chemical mixtures. Herein, we demonstrate the potential of these methods for high-capacity molecular data storage by successfully encoding and retrieving various computer file formats, including textual data and multicolour images.
To achieve the highest information storage capacity, each mixture should contain as many kinds of compounds as possible, with reliable identification of these compounds being a key requirement. In the present study, a library of metabolites including flavonoids, plant hormones and dicarboxylic acids was utilized. The widespread use of electrospray ionization mass spectrometry (ESI-MS) as the analytical tool for identifying most flavonoid classes and plant hormones can be attributed to the presence of functional groups such as phenols and carboxylic acids.35 These groups readily undergo ionization under negative ESI-MS conditions, producing abundant and single [M − H]− ions, (with minimal formation of other adducts), thus being highly favourable for information encoding and decoding schemes. Once compiled, the metabolite library remained stable for several months (see ESI_1.2† Standard solutions section), obviating the need to regenerate such a large library for each new message encoding.
Ranking different procedures for storing information relies on number and density of places where information can be stored and the amount of information that can be stored in each location or single mixture.16 The present methods offer potential for high-density storage of information per single mixture (200 bits per mixture) requiring minimal physical space. The encoded messages can be stored in the solid state by adding 5 μL (2.5 ppm) of the specific chemical mixture on a small portion of filter paper (10 mm2), thus attaining a storage density of (250 bytes per cm2). It can be stored for several months and reconstituted by adding 200 μL of methanol, filtered and directly decoded by FIA-MS and LC-MS. The spatial organization of multiple mixtures enhances storage capacity linearly while reducing the theoretical storage density relative to single metabolite mixtures. As a result, greater physical space becomes essential to support the storage of such information. The density of information achieved using spatial ordering depends on the instrumentation vendoŕs design. In the present work, sample plates have 48 positions occupying ca. 70 cm2, so the attainable spatial density is modest (19 bytes per cm2); this can be linearly enhanced by increasing the number of sample plates available (well-plates of 394 positions occupying 70 cm2 are also available, thus reaching storage densities of 155 bytes per cm2). This clearly compares unfavourably with LDI-MS methods that can potentially deposit each metabolite mixture in spots falling in the μm2 range attaining KB per cm2 densities.
Both the metabolite encoding and the specific positioning of the vials on the sample plate determine the reconstruction of the information in the correct order. Fig. 2 illustrates the binary code associated with “Universitat Jaume I” and the spatial ordering for decoding via FIA-MS. Once data were acquired, various commercially available data processing tools were employed. Chrotool, an application integrated within MassLynx 4.2 was particularly useful for rapid data visualization. It enabled the grouping of eight-metabolite subsets (see Fig. 2) according to the sequence defined during the encoding process. The extracted ion chromatograms (XICs) for each eight-metabolite group (bytes) were then automatically displayed in a “one-click” operation, facilitating data visualization. As shown in Fig. 2, strong signals were observed for metabolites encoded in vial 1 at positions 2, 4, 6, 8, 10, 11, 13, 14, 15, 17, 18, 20, 24, 26, 27, 28, 30, and 31 within the 32-metabolite sublibrary. This arrangement encoded the following four bytes: 01010101 01101110 01101001 01110110.
Larger texts can be readily encoded/decoded using our approach. A paragraph of the Nobel lecture of J. J. Thomson of 1908 (1112 bits) have been also encoded and decoded successfully (see CODE2_LCMS in ESI†). Due to the limited size of the encoded abstract information (approximately 1000 bits), conventional analyses of error rates and digital error correction are not directly applicable or informative for the present datasets. It is important to note that the peak area associated with the presence of a metabolite was two orders of magnitude higher than the noise level observed in the absence of that metabolite, which (i) eliminated the need for adjusting metabolite concentration ratios during encoding to achieve uniform intensities,18 (ii) ensured error-free decoding (iii) allowed for reliable automation of the process and (iv) could be further optimized to use more diluted samples (and increase the density expressed in bytes per g) or enlarge the subset of compounds per mixture.
We established an automated and general procedure for recovering the stored information, utilizing TargetLynx, an application integrated within MassLynx 4.2, which automates data processing and reporting. This tool incorporates robust confirmatory checks to identify analytes based on our previous empirical mass spectra library as well as user-defined intensity thresholds. The TargetLynx workflow used to recover the original message is shown in Fig. S1.† TargetLynx allows for the extraction of XICs for each metabolite, along with the calculation of their integrated areas, which can then be saved as a comprehensive summary in a .txt file. A Python-based code was developed in which the identity of each metabolite, as defined during encoding, and its integrated peak area was considered to determine the presence or absence of each metabolite (the detailed scripts are appended in the ESI†). When a metabolite is present, the corresponding bit was assigned a value of “1” (and “0” if absent). The binary values were then concatenated, and the final binary code was converted into the original ASCII message.
The presence of isoquercetin is evident in the extracted ion chromatogram (XIC) shown in Fig. 3(b) (bottom), where a minor XIC associated with the ion at m/z 301.0348 (see Fig. 3(b) top), formed by in-source fragmentation of isoquercetin, is also observed. Such in-source dissociation occurs at a low extent (less than 5% relative to the XIC of isoquercetin) and does not result in the detection of false positives (presence of quercetin when indeed it is absent) using our general processing workflow (see ESI†). However, as will be discussed below, such metabolite interference constitutes a major hurdle for quantitative analysis. (ii) Metabolites with the same nominal mass: due to the absence of chromatographic separation, FIA is inherently less selective than LC-MS techniques. In particular, when using low-resolution MS instruments (see Experimental section), decoding messages encoded with metabolites of the same nominal mass becomes problematic and it is critical to select an encoding metabolite library that avoids isobaric analytes. Greater flexibility in choosing metabolites can be achieved by coupling FIA with high-resolution MS (FIA-HRMS), which allows for more precise metabolite identification through accurate m/z determination. This advantage is exemplified in the successful encoding/decoding of the text “Universitat Jaume I”, where alpinetin (C16H14O4; [M − H]−m/z 269.0814) and genistein (C15H10O5; [M − H]−m/z 269.0450) were used from the 32-metabolite subset; (iii) Interference between regioisomers: identifying regioisomers using FIA-MS poses a significant challenge due to the lack of separation before MS analysis. In the case of flavonoids, regioisomers often share identical fragmentation patterns and are difficult to differentiate based solely on the relative branching ratios of their product ions under collision-induced dissociation (CID) conditions. To overcome this challenge, ion mobility spectrometry mass spectrometry (IMS-MS) can provide a solution, as it allows for the distinction of regioisomers based on the shape and size of their associated gas-phase ions. The regioisomers genistein, baicalein, and galangin (C15H10O5) were investigated using IMS-MS. Fig. 4(a) displays their negative ESI mass spectra, while Fig. 4(b) shows the corresponding mobility traces for their [M − H]− ion at m/z 269.0501. The ability to differentiate these three regioisomers based on their distinct drift times demonstrates the feasibility of utilizing regioisomers as encoding analytes, provided their identification relies on both m/z and drift time measurements.
The increase in storage capacity achieved through quantitation follows a logarithmic (log
2) function, meaning that analyte determination at four or eight levels can result in a two- or three-fold increase in capacity, respectively. This enables the use of higher quaternary or octal encoding systems. The quaternary (base-4) numeral system, which employs the digits 0, 1, 2, and 3, can represent any real number associated with four concentration levels. Interestingly, the concept of quaternary encoding was inspired by the genetic code of DNA, where digital information can be encoded into sequences of A, T, C, and G. A binary number can be converted into quaternary form, where each digit corresponds to a binary pair (00, 01, 10, 11) converted to 0, 1, 2, and 3, respectively. This approach significantly reduces the overall task of implementation (reducing a half both the encoding and decoding steps) and reducing the overall latency of FIA-MS and LC-MS methods. Similarly, the octal (base-8) numeral system, using digits 0 to 7, operates by converting each binary digit triplet (000, 001, 010, 011, 100, 101, 110, 111) into the corresponding octal values. The use of both the quaternary and octal encoding systems also has implications for high-level information encryption. For example, with eight possible concentration levels for each component, half a dozen metabolites can generate approximately 260
000 combinations, each representing a different symbol. This creates the opportunity for developing a degenerate encoding key, where a given symbol is represented by multiple combinations. Such degenerate keys enhance the reliability, flexibility, and security of systems requiring robustness, such as in digital communications and cryptographic applications.41
In our experiments, we encoded texts and multicolour images using the identification and quantitation of chemical mixtures at different concentration levels. The logo of our institution, represented as a 25 × 10 grid, was encoded in quaternary code using 10 sets of 25 metabolites (see CODE3_QUAT_FIAMS and CODE4_QUAT_LCMS in ESI†). Four concentration ranges of metabolites were employed to define the 0–3 values in the quaternary encoding system. For each of the 10 vials, metabolites intended for inclusion were manually transferred from their stock solutions. Calibration mixtures were prepared from a single standard mixture of the 25 flavonoids at 1000 ppb by subsequent dilution covering the desired concentration ranges. Information retrieval using both FIA-MS and LC-MS was achieved with 100% efficiency, and the decoded image is shown in Fig. 5(a). To recover the original message, a Python script was utilized, where four threshold values were established to assign the 0–3 values. These values were then concatenated, and the resulting quaternary code was converted back into the original image.
The word “flavonoids” was converted into 80-bit data and stored in a single vial using a subset of 80 metabolites. Decoding via FIA-MS was conducted with a 100% readout accuracy (see CODE4_FIAMS in ESI†). To demonstrate the effectiveness of the shortening implementation, consider the binary encoding vector associated with the word “flavonoids”. As illustrated in Fig. 5(b) each 24-bit segment (representing 24 different analytes) can be represented by a sequence of just eight metabolites by using eight distinct concentration levels, thus achieving the same data storage capacity. Accordingly, a single vial containing 27 metabolites at eight concentration levels was used for octal encoding and satisfactory LC-MS decoding, replacing the need for 80 metabolites (see CODE5_OCTAL_LCMS in ESI†). To recover the original message, a Python code adapted from the binary data retrieval process (see ESI†) was used. Specifically, eight threshold values were defined to assign values between 0 and 7, which were then concatenated. The octal code was subsequently converted into binary and finally the original text was retrieved.
The FIA-MS approach achieves maximum storage capacities of 80–100 bits per mixture. Due to the lack of chromatographic separation, metabolite identification in FIA-MS relies solely on m/z values, which narrows the scope of the metabolite list that can be included, especially when utilizing low-resolution FIA-MS. Wider encoding analyte scope can be achieved using high-resolution (HR) MS and ion mobility spectrometry (IMS) MS capabilities. HR-MS enables the discrimination of isobaric compounds, particularly within metabolomic compound classes whereas IMS-MS introduces a rapid, millisecond-scale, post-ionization separation dimension, enhancing the ability to distinguish regioisomers through a combination of m/z and drift time. LC-MS methods, on the other hand, combine the broadest metabolite coverage in comparison to other techniques and a wide flexibility at choosing the encoding library metabolites, as each metabolite is uniquely defined by both m/z value and retention time. The maximum storage capacity for LC-MS in routine metabolomic analysis, demonstrated here at 200 bits per mixture, has the potential to reach 104 bits per mixture, representing the highest data storage capacity achieved in molecular data systems based on single chemical mixtures.
The expansion of data storage capacity per mixture as well as the encoding and decoding latency could be significantly optimized by integrating additional libraries, adopting faster robotic-based encoding methods, or leveraging advanced metabolomic data-processing tools. Although the present demonstrations focused on libraries of plant hormones and flavonoids, the approach is adaptable to other metabolite families, provided they are amenable to the electrospray ionization (ESI) process. This compatibility extends the method's applicability to a wide range of metabolites, including small molecules that may pose analytical challenges for laser desorption ionization (LDI) techniques due to matrix interference in low m/z ranges. In this context, lipids, which constitute approximately three-quarters of the human metabolome database (HMDB 4.0, https://www.hmdb.ca), encompassing around 5 × 104 lipid species, offer substantial potential for constructing large capacity devices based on chemical mixtures. Furthermore, automated synthesis of expansive synthetic libraries via multicomponent reactions, coupled with robotic liquid dispensing into vials, can increase encoding efficiency. As shown in this work, metabolomics provides an optimized and user-friendly analytical environment, largely due to its extensive history of software development, allowing for rapid data processing with minimal adjustments. Additionally, other widely used hyphenated techniques in metabolomics, such as gas chromatography-MS and capillary electrophoresis-MS, could also be adapted for high-capacity chemical mixture-based memory devices. Integrating foundational principles from metabolomics into synthetic chemical mixtures can offer novel avenues at the intersection of information technology, plant physiology, analytical chemistry, and chemical sciences.
Footnote |
| † Electronic supplementary information (ESI) available: Detailed experimental procedures and instrumentation was well as codes, output quali- and quantitation results and python scripts for message retrieval are given as supplementary material in CODES_SI.zip; .raw data are deposited in https://doi.org/10.5281/zenodo.15101114; the complete scripts for identification and quantitation using binary, quaternary and octal schemes are available at https://github.com/catm542-ai/ChemDataProcessor. See DOI: https://doi.org/10.1039/d5an00353a |
| This journal is © The Royal Society of Chemistry 2025 |