Open Access Article
Petra
Sőregi
ab,
Márton
Zwillinger
a,
Lajos
Vágó
c,
Márton
Csékei
*a and
Andras
Kotschy
*a
aServier Research Institute of Medicinal Chemistry, Záhony utca 7, 1031 Budapest, Hungary. E-mail: andras.kotschy@servier.com
bHevesy György PhD School of Chemistry, Eötvös Loránd University, Pázmány Péter sétány 1/A, 1117 Budapest, Hungary
cKastély u. 49/A, 2045 Törökbálint, Hungary
First published on 22nd August 2024
The need for reliable information storage is on a steep rise. Sequence-defined polymers, particularly oligonucleotides, are already in use in several areas, while compound mixtures also offer a simple way for storing information. We investigated the use of a set of isotopologues in information storage by mixing, where the information is stored in the form of a mass spectrometric (MS) fingerprint of the mixture. A small molecule with 24 non-labile and replaceable hydrogen atoms was selected as a model, and a set of components covering the D0–D24 deuteration range were synthesized. Theoretical analysis predicted that by mixing up to 10 out of the prepared components, one can encode over 130 million different combinations and distinguish their MS fingerprints. As a proof of principle, several mixtures predicted to have similar fingerprints were prepared and their MS fingerprints were recorded. From each measured MS fingerprint, we were able to unambiguously identify the actual composition of the mixture. It was also demonstrated that one can make the MS fingerprints of a given mixture unique, thereby making counterfeiting of the stored information very difficult. Finally, the utility of isotope ratio encoding in covalent tagging was also demonstrated.
Another emerging approach uses compound mixtures for chemical information storage.15 The proper selection of the components, usually all having distinct molecular characteristics (e.g. MW16 or NMR chemical shift17) allows for the simple and reliable writing and reading of the code. Although simpler to execute, the information storage capacity of this approach is below sequence defined polymers'. A specific kind of information storage through mixing uses isotopologues of the same molecule as components.18 Varying the isotope composition without changing the molecular formula or structure alters the MS signal of the component but does not impact its other characteristics, which makes it appealing for biological applications. It was also reported recently that isotope ratio encoding can be combined with sequence defined coding.19,20 The unique identifiers in this latter method were the MS fingerprints of covalently linked D0–D4 isotopologue mixtures that could be recorded with standard high-resolution spectrometers. Another, recently published approach involves the use of 13C/12C isotope ratios to encode information that could be read by NMR spectroscopy.21
The sensitivity and reproducibility of the MS measurements suggest that the information storage capacity of isotope ratio encoding might go well beyond that of the other mixing techniques. Here we show that mixing a set of up to ten components resulting from the multi-deuteration (from 0 to 24 deuterium atoms) of the same molecule, allows for the generation of over 130 million different isotopologue combinations whose MS fingerprints are unambiguously distinguishable. We also highlight how varying the extent of deuteration of the individual components gives rise to unique MS fingerprints, making the counterfeiting of the stored information exceedingly difficult.
In practical applications of isotope ratio encoding, one stores information in mixtures of components. The MS trace of the mixture, which we will call an MS fingerprint from now on, is the combination of the MS traces of the individual components where their respective intensity is proportional to their fraction in the mixture. Fig. 1B (bottom left) shows how the MS fingerprint of a 1
:
1 mixture of a D0 and a D5 component evolves from the MS traces of the ingredients. The essence of using isotope ratio encoding is the accurate and reproducible recording of MS fingerprints and the quantitative assessment of their similarity. To this end, we chose the “Normalized Dot Product” function (NDP) that is widely used in proteomics search algorithms,22–24 as well as for the comparison of the mass spectra of small molecules.25,26 An NDP score can take up a value between 0 and 1, and the closer the NDP value is to 1, the higher the similarity of the MS fingerprints is. To assess the limits of information storage by isotope ratio encoding we selected a molecule (Fig. 1A) that enabled the expansion of the deuteration window to 0–24 deuterium atoms per molecule. Our action plan was relatively straightforward: (1) synthesis of the D0–D24 components, ideally with a high deuterium incorporation efficiency, (2) exact quantification of the isotopologue composition of components by HRMS, (3) in parallel, theoretical investigation of the information storage capacity by isotope ratio encoding with our isotopologue compound collection, and assessment of how the quality of the prepared components impacts this capacity, (4) selection and application of the mixing rules to prepare mixtures and subsequent analysis of the mixtures by HRMS, (5) proof of principle experiments testing MS pattern recognition and code reading from the prepared mixtures, and (6) looking into the limitations that might arise if high density isotope ratio encoding is used for tagging.
First, we wanted to understand the information storage potential of the theoretical D0–D24 isotopologue compound collection of 1. To this end, we defined the mixing rules, then created a virtual set of mixtures. We calculated the theoretical isotope fingerprint collection for the given virtual set and calculated the similarity of each pair of isotope fingerprints within the same virtual set. The theoretical isotope fingerprints were generated as a linear combination of the isotope traces of the compounds in the mixture. The isotope trace for any atomic composition was obtained from EnviPat service.27 To quantify the similarity of any pair of fingerprints, the NDP function was used, as described above. The smaller the highest NDP value within a virtual set is, the less alike the fingerprints are in a code. Our earlier studies have proven that the laboratory MS equipment can reliably distinguish two fingerprints with a similarity below NDP = 0.9990, therefore this value was set as a benchmark in the modelling. It is trivial that the number of mixtures we can prepare from a given set of isotopologue compounds depends on two factors: the maximum number of components we use within a certain mixture and the proportion size we use for mixing (i.e., 10% or 25% increments). In our system, having 25 isotopologue compounds (D0–D24), when using up to two components for a given mixture with a ratio of any component being 25% or its multiple (i.e., no component at 0%), we arrive at 900 possible mixtures and the most alike MS fingerprints have an NDP of 0.9573 (Table 1). Decreasing the minimal ratio of a component to 10%, the number of possible mixtures increases to 2700 and the NDP of the most similar fingerprints changes to 0.9945. If we increase the allowed number of isotopologues in a mixture to three, then using the same increment sizes as before, we arrive at 6900 and 82
800 mixtures, respectively (Table 1). It is interesting to note that the larger number of components also results in a decreased similarity with the NDP values of 0.8994 for the 25% and 0.9912 for the 10% mixing. When we use four components from the D0–D24 compounds in a mixture with a 25% increment, the number of mixtures increases to 12
650 and the similarity of the most alike pair drops to an NDP of 0.8538. When using the same set with a 10% increment, the number of mixtures exceeds 1 million and the NDP is 0.9887. To exploit the maximal potential of the 10% mixing increment we can allow up to 10 different components in any mixture. In this case, the number of possible mixtures increases significantly to 131
128
140 without compromising the quality of the coding. It is not surprising that the most alike mixtures in this code are the binary ones already identified above (Table 1, entry 4) and the NDP value is therefore the same 0.9945. Of course, setting the increment at 10% and allowing only multiple-fold increments are arbitrary restrictions but the vast number of distinguishable mixtures we can generate this way demonstrates the power of this information encoding principle well.
| No. of components | Composition unit | No. of mixtures | Highest NDP using D0–D24 MS fingerprints | Highest NDP using 1a–y MS fingerprints |
|---|---|---|---|---|
| a Calculated for C182H224N20O32. | ||||
| 2 | 25% | 900 | 0.9573 | 0.9857 |
| 3 | 25% | 6900 | 0.8994 | 0.9813 |
| 4 | 25% | 12 650 |
0.8538 | 0.9739 |
| 2 | 10% | 2700 | 0.9945 | 0.9984 |
| 3 | 10% | 82 800 |
0.9912 | 0.9981 |
| 4 | 10% | 1 062 600 |
0.9887 | 0.9976 |
| 1–10 | 10% | 131 128 140 |
0.9945 | 0.9984 |
| 1–10a | 10% | 131 128 140 |
0.9992 | nd |
In applications where the isotope ratio encoding is used as a tagging method the inherent isotope distribution of the tagged molecules can interfere with the coding. To simulate such a situation, we increased the formula of our D0–D24 isotopologues to a decamer's (C182H224N20O32) from the original monomer's (C20H26N2O5). Indeed, when we create mixtures of up to 10 components with an increment of 10% using these decamers with increased atomic numbers, the most similar mixtures will have an NDP of 0.9992 that might be difficult to distinguish reliably. On the other hand, when increasing the increment size in the mixing to 20%, which still enables us to code 118
755 molecules, the NDP fell to 0.9966, which is well within our applicability range. Of course, this theoretical scenario implies that we want to distinguish over a 100 thousand molecules of the same atomic composition, which is more stringent than real applications. As soon as the molecules we want to distinguish by tagging cover a more diverse set of atomic compositions, we can proportionally increase the number of the entities we can code by tagging.
Since the combined effect of natural isotope abundance and enrichment of specific isotopologues leads to complex MS patterns, a problem identified and treated in metabolomic studies,28,29 the exact isotopologue distribution of 1a–y was deconvoluted from their mass spectra (for details see ESI and Table S2†). As expected, achieving 100% incorporation efficiency in every reaction and position was not feasible, and this influenced the composition of the products, demonstrating the necessity of rerunning the calculation on the theoretical studies once the components are synthesized and qualified. Initially, our goal was to understand how the actual isotopologue composition affects the effectiveness of coding. Subsequently, we aimed to assess the performance of the encoding method by preparing specific component mixtures from 1a–y, recording their experimental MS spectra, and determining if we could accurately determine the mixtures' composition by comparing the recorded MS fingerprints with the collection of calculated fingerprints.
800 ternary mixtures' and 1
062
600 quaternary mixtures' MS fingerprints (1
148
125 in total), arising from mixing 1a–y, 22 examples were found, where the similarity to the closest analogue was greater than 0.9980, while 5174 mixtures gave an NDP between 0.9980 and 0.9970. Interestingly, while for most mixtures the highest similarity relationship was reciprocal (i.e., mixture Mx was the most similar to mixture My, and vice versa), this was not a general rule. We selected and prepared all 22 mixtures where the predicted similarity with the most alike mixture was above 0.9980 (binary and ternary mixtures) and several other ternary and quaternary mixtures with a predicted similarity above 0.9970 (34 in total, see Table S3† for details). For simplicity, here we present 4 of the binary mixtures (M1–M4) and 5 of the ternary mixtures (M5–M9), where the predicted similarity was amongst the highest and 5 of the quaternary mixtures (M10–M14). The composition of the mixtures is shown in Table 3. This table also lists the composition and similarity of each mixture to its closest analogue based on their calculated MS spectra.
| Mixture | Composition | Closest analogue and NDP based on calculated MS fingerprints | NDP of calculated and measured MS fingerprints | Closest analogue and NDP based on measured MS fingerprints |
|---|---|---|---|---|
| M1 | 1i (0.9) – 1n (0.1) | M2 0.9982 | M1 1.0000 | M2 0.9983 |
| M2 | 1i (0.9) – 1m (0.1) | M1 0.9982 | M2 1.0000 | M1 0.9982 |
| M3 | 1a (0.9) – 1n (0.1) | M4 0.9983 | M3 0.9997 | M4 0.9984 |
| M4 | 1a (0.9) – 1m (0.1) | M3 0.9983 | M4 0.9998 | M3 0.9985 |
| M5 | 1g (0.8) –1h (0.1)– 1n (0.1) | M6 0.9981 | M5 0.9998 | M6 0.9983 |
| M6 | 1g (0.8) –1h (0.1)– 1m (0.1) | M5 0.9981 | M6 0.9995 | M5 0.9979 |
| M7 | 1a (0.8) –1b (0.1)– 1n (0.1) | M8 0.9981 | M7 0.9987 | M8 0.9980 |
| M8 | 1a (0.8) –1b (0.1)– 1m (0.1) | M7 0.9981 | M8 0.9991 | M7 0.9976 |
| M9 | 1d (0.1) –1l (0.5)– 1m (0.4) | M10 0.9975 | M9 0.9998 | M10 0.9975 |
| M10 | 1d (0.1) –1l (0.5)– 1m (0.3)– 1n (0.1) | M9 0.9975 | M10 0.9996 | M9 0.9968 |
| M11 | 1a (0.1)– 1g (0.7)– 1l (0.1)– 1n (0.1) | M12 0.9974 | M11 0.9998 | M12 0.9971 |
| M12 | 1a (0.1)– 1g (0.7)– 1l (0.1)– 1m (0.1) | M11 0.9974 | M12 0.9999 | M11 0.9976 |
| M13 | 1a (0.6) – 1b (0.2) – 1n (0.1) – 1u (0.1) | M14 0.9971 | M13 0.9986 | M14 0.9961 |
| M14 | 1a (0.6) – 1b (0.2) – 1m (0.1) – 1u (0.1) | M13 0.9971 | M14 0.9978 | 1a (0.7) – 1b (0.1) – 1m (0.1) – 1u (0.1) 0.9958 |
| M35 | 1a (0.1)– 1b (0.1)– 1e (0.1)– 1i (0.1)– 1l (0.1)– 1n (0.1)– 1r (0.1)– 1s (0.1)– 1v (0.1)– 1y (0.1) | Nd | M35 0.9923 |
To establish the optimal conditions for code reading by MS, we prepared a dilution series of 1a in acetonitrile, determined the linearity range of the detector, and adjusted the injected quantities to fit within that range. Mixtures were prepared by the volumetric dispensation of stock solutions in propionitrile. The MS fingerprints were recorded following dilution to the desired concentration range.
First, the measured and theoretical MS fingerprints of M1–M34 were compared (Table 3) to assess the impact of the experimental error from mixing on the coding. It was reassuring to see that most of the NDPs exceeded 0.9990, and the lowest one (M14) was still 0.9978. Next, the measured MS fingerprints of M1–M34 were compared to the database of 1
148
125 calculated MS fingerprints searching for the most similar ones for each of them. For reliable information encoding the measured fingerprint must unambiguously match its calculated equivalent. Our results showed that for all the prepared and experimentally characterized mixtures (M1–M34) the composition could be identified with high certainty because the most similar calculated fingerprints always matched the experimental compositions. This result proves that isotope ratio encoding is an efficient tool to write and read information. An interesting indicator of the reliability of information reading from an isotope mixture is the separation between the most similar and the second most similar calculated MS fingerprints when compared to the measured one. In the M1–M34 set, the NDP values for the second most similar fingerprints were in the range of 0.9951–0.9985, all showing a considerable separation from the first hit (Table 3). It is interesting to note that 11 out of the 14 predicted closest analogues (mixtures with the second most similar calculated MS fingerprint) were identified both by using the calculated or the measured MS fingerprint of the primary mixture. For the four mixtures, M14, M31, M33 and M34 the closest analogues identified by using their measured or the calculated MS fingerprints were very similar, deferring in only a single component present in 10% (for details see Table S3†). The high similarity between the calculated and measured MS fingerprints for M1–M34 suggests that reliable information storage with these mixtures is feasible. On the other hand, these mixtures contained only 4 components, which is well below the potential of the method. To evaluate the robustness of mixing and the potential impact of variations in the volumetric dispensing, the 10-component mixture M35 was also prepared. Its MS fingerprint was recorded and compared with the calculated fingerprint based on its composition (see ESI† for details). The obtained similarity was 0.9923, which is somewhat lower than for the mixtures with fewer components. On the other hand, it was already shown (Table 1) that the more components the mixture has the less similar its closest analogue will be. Comparison of the measured MS fingerprint of M35 with the calculated MS fingerprints of potential close analogues enabled the unambiguous identification of its composition, reaffirming the power of this information encoding method.
![]() | ||
| Fig. 2 Similarity of the 1d(0.25)-1i(0.25)-1n(0.25)-1t(0.25) mixture (M36A) and its 1d* and 1i* containing analogues (M36B–D). | ||
![]() | ||
| Scheme 3 Test reaction used to assess the utility of the information storage as a covalent tag (deuteriums not shown). | ||
This transformation was selected because amide coupling is by far the most frequently used chemical route for covalent tagging due to its robustness. Based on the MS fingerprint of the isotopologue mixture M37 and of the amine, the theoretical MS fingerprint was calculated for the product M38 and compared with the measured MS fingerprint (see ESI† for details). We were delighted to find that the measured and the calculated MS fingerprints of the tagged molecule showed a remarkably high similarity (NDP = 0.9993), well within the range we observed previously (cf.Table 3), therefore proving that the amide coupling did not impair the isotope ratio code.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4sc03519d |
| This journal is © The Royal Society of Chemistry 2024 |