Adding open spectral data to MassBank and PubChem using open source tools to support non-targeted exposomics of mixtures

The term “exposome” is defined as a comprehensive study of life-course environmental exposures and the associated biological responses. Humans are exposed to many different chemicals, which can pose a major threat to the well-being of humanity. Targeted or non-targeted mass spectrometry techniques are widely used to identify and characterize various environmental stressors when linking exposures to human health. However, identification remains challenging due to the huge chemical space applicable to exposomics, combined with the lack of sufficient relevant entries in spectral libraries. Addressing these challenges requires cheminformatics tools and database resources to share curated open spectral data on chemicals to improve the identification of chemicals in exposomics studies. This article describes efforts to contribute spectra relevant for exposomics to the open mass spectral library MassBank (https://www.massbank.eu) using various open source software efforts, including the R packages RMassBank and Shinyscreen. The experimental spectra were obtained from ten mixtures containing toxicologically relevant chemicals from the US Environmental Protection Agency (EPA) Non-Targeted Analysis Collaborative Trial (ENTACT). Following processing and curation, 5582 spectra from 783 of the 1268 ENTACT compounds were added to MassBank, and through this to other open spectral libraries (e.g., MoNA, GNPS) for community benefit. Additionally, an automated deposition and annotation workflow was developed with PubChem to enable the display of all MassBank mass spectra in PubChem, which is rerun with each MassBank release. The new spectral records have already been used in several studies to increase the confidence in identification in non-target small molecule identification workflows applied to environmental and exposomics research.


Introduction
The environment plays an essential role in the well-being of humanity.However, humans and all living beings are subjected to different chemical exposures, which can result in direct and indirect consequences on human health and the environment.This raises concerns about the impacts of exposures over our lifetime, making it imperative to study chemical exposure to recognize the effects on humans and ecosystems. 1 Furthermore, the volume and number of chemicals in use are increasing drastically. 2Not all chemicals have immediate or obvious health effects.Some emerging contaminants (ECs) may effect humans and the environment even at low doses, while many effects go unnoticed until they reach a critical level to cause a given health issue.It is possible that the effects may not be seen for many years or decades.One such example is Parkinson's disease. 3ECs comprise a diverse group of chemicals including pesticides, pharmaceuticals, endocrine disruptors, personal care products, surfactants, ame retardants, plasticisers, industrial agents, articial sweeteners, and gasoline additives. 4hemicals can interact in numerous possible ways, making it difficult to predict the effects of different chemical mixtures. 5o far, epidemiological studies have generally focused on single factors but sidelined the combined effect of different chemical exposures. 6Due to poor characterization of exposures and growing awareness of the need to study several exposures simultaneously, interest in the concept of the exposome is increasing. 7The term exposome was rst proposed by Christopher Wild in 2005, when he dened it as the "life-course environmental exposures (including lifestyle factors), from the prenatal period onwards". 8,9Studies on the exposome can encompass small molecules (e.g., lipids, drugs, xenobiotic compounds, metabolites) as well as non-chemical stressors such as radiation, industrial processes, consumer goods, pathogens and diet.In exposome analysis, mass spectrometry (MS) is a widely used analytical technique, mainly due to its suitability to perform a sensitive qualitative and quantitative analysis of complex samples.MS, when coupled with gas or liquid chromatography for separation before mass spectrometric detection, 10 covers a wide range of applications by separating out sample complexity and providing additional structural information about the compounds as they elute from a chromatographic column.Methods based on gas chromatography coupled with MS (i.e., GC-MS) analysis have been developed to analyze different classes of chemical substances, including highly hydrophobic and volatile substances.However, more polar pesticides, water-soluble endocrine disruptors and industrial pollutants are more amenable to liquid chromatography (LC)-MS methods. 11LC-MS, especially high resolution MS (HR-MS), is widely used for screening emerging pollutants and the detection of different chemicals can rely on many different tools. 12The high throughput omics technologies (metabolomics/exposomics) have helped in integrating a wide range of exposures. 7Advancements in analytical instruments, high throughput statistical analysis, HR-MS, data processing algorithms, chemical compound databases and mass spectral libraries contribute to exposome research. 13Ideally, this would lead to the identication of chemical stressors and robust biomarkers that can be used to deduce the adverse effects of exposures. 14ass spectrometry analysis can be broadly categorized into two different approaches: (i) targeted analysis and (ii) nontargeted analysis (NTA).In targeted analysis, the compounds (so-called "targets") are known in advance and can be identied in the sample through optimized laboratory methods developed with the use of reference standards.This method requires prior knowledge of the compounds and suitable methods to be developed for the targeted approach. 15However, ECs are oen overlooked in samples when only targeted analysis is performed, since it is impossible to perform target analysis on the multitude of ECs potentially present in the environment.This can be addressed using a non-targeted approach to identify "unknowns" in the sample. 16Compound-specic data is collected via instrumental analysis ('detection') and the detected mass spectrometric feature is then linked with the tentative chemical identity based on the evidence, in a process called 'annotation.'The process of verifying that the annotated compound is indeed the proposed chemical is called 'identication'. 17Despite all these steps in identifying the compounds, the peaks cannot always be interpreted with the same condence level. 18To report identication condence more accurately, a ve-level scheme ranging from level 1(conrmed) to level 5 (exact mass only) is oen used to report the identication condence level. 19Condent identication of unknown chemical substances in MS studies requires consistent workows and corresponding computational tools and data.To increase the identication condence, it is vital to use experimental evidence combined with chemical metadata, library searching and chemistry databases in the workow.A mass spectral library will facilitate level 2a identication condence by providing a sufficient match of the probable structure with the library spectrum.A level 1 condence can be achieved when the retention time (RT) of the data acquired in the same analytical setup in-house matches with the library spectrum. 19Supported by such information, NTA methods can use advanced analytical instruments, spectral libraries and computational tools to help identify "known unknowns" or previously understudied compounds.Thus, NTA methods can be used to screen a broad spectrum of chemicals occurring in the environmental samples, which is essential to explore the exposome.This method does not need any prior knowledge about the chemicals to be screened and allows rapid chemical characterization of chemicals. 15 typical NTA workow includes several steps: sample collection, sample preparation, analysis and data processing.However, each of these steps becomes a tedious processespecially the data treatment step.The data processing steps mainly involve peak picking, peak alignment, peak integration and nally identifying the structure of the compound behind the peaks.18 Thus, creating an in-house spectral library will improve future NTA workows internally, potentially to level 1.Additionally, uploading the spectra online will facilitate level 2a identications at other institutes.Mass spectral libraries are capable of providing rapid tentative identications with a relatively high level of condence in this regard.20 MassBank was established in 2006 and published in 2010 as one of the rst open mass spectral libraries, hosted in Japan.21 MassBank Europe was established as a mirror server in 2012.22 Over the years the code and data migrated to GitHub, 23  Since mass spectral libraries are inherently incomplete and the fragmentation spectrum of a given chemical typically only conveys a fraction of the chemical information, it is challenging to ascertain the identity of a chemical in NTA using fragmentation information alone.Additionally, compound (i.e., chemical) databases can be used to more accurately identify and annotate chemicals in a sample.It is possible to search compound databases for candidates by using the exact mass or calculated molecular formula, combined with in silico fragmentation techniques to sort the candidates.20 The National Institutes of Health (NIH) maintains PubChem, an open chemistry database containing more than 115 million compounds (https://pubchem.ncbi.nlm.nih.gov/).24 PubChem is one of the most comprehensive open online chemical databases.Data in PubChem is available for download in multiple formats (including CSV, XML, JSON, and SDF) and is complemented by a range of web services and APIs. PuChem offers many features for researchers and scientists in a variety of disciplines.As the community moves beyond pure chemical identication towards more detailed interpretation of HR-MS datasets, improving the content in publicly available databases is imperative.
In August 2015, EPA's Non-Targeted Analysis Collaborative Trial (ENTACT) was initiated to evaluate the efficiency of NTA methods for identifying unknown chemicals in samples.The purpose of this trial was to generate, interpret and exchange different NTA results and set up a database of spectral records that can be used for future NTA evaluation in identifying unknown compounds.ENTACT offered a way to obtain large numbers of chemicals of various classes for method development, as well as a means to test various workows.This study aimed to generate MassBank records for the ENTACT mixtures, to expand the internal LCSB spectral library as well as the public MassBank.EU library, to support future non-target studies on complex mixtures of chemicals.This article describes the methods and results of the data processing, including the uploading of the spectra to MassBank, plus the subsequent integration studies on complex mixtures of chemicals within PubChem using open workows.

ENTACT mixture composition
The ENTACT dataset contains 1268 substances (in total) sourced from individual ToxCast chemical substances, which were distributed into 10 mixes. 1 The ten synthetic mixtures (mixes 499, 500, 501, 502, 503, 504, 505, 506, 507, and 508) contain 95-365 substances in each mix with varied levels of complexity, as indicated in Table 1.Mixes 1-4 and 9 contained a total of 95 unique substances, mixes 5 and 6 contained 185 substances each, of which mix 5 included the replicate set of mix 1, while mixes 7, 8 and 10 each contained 365 substances.Mix 7 comprised 270 substances plus the replicate set from mix 1. Mix 9 was especially designed to be the most challenging mix due to the presence of many isobaric substances (substances with identical mass -dened further below), while mix 10 contained substances of low purity and low concentration, 1 posing a complementary analytical challenge to the other higher quality mixes.
The ToxCast list provided with the ENTACT mixes contained the following information, which was used for curating the input compound list: Identiers: DSSTox substance (DTXSID) and compound (DTXCID) identiers from CompTox, 25 and the Chemical Abstract Service (CAS) registry number; 26 Structure in the Simplied Molecular Input Line Entry System (SMILES) format; 27 InChIKey (the hash of the International Chemical Identi-er, InChI 28 ); Preferred name, molecular formula and exact mass; MS-ready forms 29 of SMILES, InChIKey, molecular formula and exact mass.
An internal identier (termed "Identier") was also assigned to map candidates to the correct MS/MS record in RMassBank.The "MS-ready" form refers to a structure that has had counterions and charges removed to represent a neutral form that will be analyzed in the instrument (pre-ionization). 29

Chemical analysis
Following sample receipt, an initial test analysis was conducted on the ENTACT mixtures on 12 August 2019 and remeasured on 3 March 2020.Reverse Phase Liquid Chromatography (RPLC) was performed using an Acquity UPLC BEH C 18 column (dimensions: 1.7 mm, 2.1 × 150 mm) from Waters with a guard column.The ow was set at 0.20 mL min −1 using water with 0.1% formic acid (A) and methanol (B) as mobile phase.The gradient started at 90% of A and 10% of B at 0 min until 2 min before reaching 100% at 15 min.Mass spectrometric detection was performed with a Q Exactive Orbitrap HF (Thermo Scien-tic) in positive and negative ionization mode, using electrospray ionisation (ESI).The product ion (MS/MS) spectra were acquired in data-dependent mode using a mass list of all the mixtures as an inclusion list.The MS/MS spectra for the samples were fragmented at 6 different nominal collision energy (NCE) levels (15, 30, 45, 60, 75, and 90 NCE) in separate runs, generating data les for individual collision energies per mixture and mode.

Data analysis
The workow for the creation of MassBank records from the ENTACT mixtures is shown in Fig. 1.The samples and chemical analysis (shown in orange) are described above.The data extraction (blue middle section) was performed using the R packages Shinyscreen 30 and RMassBank 31 via the wrapper RMBmix, 32 as explained in the next sections.The generated records were then validated using ReSOLUTION and the MassBank validation checks prior to upload (green section, Fig. 1).All packages and code are available on the ECI GitLab pages (see "Data availability" and "References").

Data extraction, prescreening and quality control with Shinyscreen
Shinyscreen is an interactive web interface developed using the shiny package in R, building on the MSnbase 33 and mzR packages to explore, process and interpret MS data.It is specically designed to extract the spectral data, perform an automated quality check (based on peak alignment, retention time shi, intensity and signal to noise ratio) and visualize the MS1 chromatograms plus MS/MS scans and spectra.The automated quality analysis can be updated manually aer visual inspection.Shinyscreen, currently at v1.3.21, is a semi-automated computational tool to reduce the efforts of manual prescreening. 34This work was performed using an older version of Shinyscreen (v0.51).
The rst step in the data analysis was to convert raw data les from the LC-MS analysis to the mzML le format 35 using MSConvert from Proteowizard (v.3.0.19182-51f676e) 36 in peak picking mode to convert the spectral information from prole to centroid mode.The chemical analysis resulted in mzML les per mix, mode (positive and negative) and per collision energy.These were congured as input data les in Shinyscreen, along with the 'compound list' for the appropriate mix (dening the chemicals to screen) and the 'setID list' (used to map the les, compounds and mixes).The compound list contained the unique identiers (IDs) assigned to the ENTACT compounds, along with the name and MS-ready SMILES, saved as Comma Separated Value (CSV) le.SetID is a CSV le that includes the ID and a set (mix name), which is used to identify the compounds in each mix.The input lists are provided in the GNPS repository 37  View Article Online time tolerance was set to +/−0.5 minutes before the extraction was initiated.The MS/MS scan was deemed to be RT shied (too far from the MS1 apex) when it exceeded the RT tolerance limit and was thus excluded from further consideration.Once the threshold values were established, the data was preprocessed to perform automatic quality control analysis.Preprocessing generates a le table in CSV format, which is updated and saved at the end of the process.The compounds that failed to meet the quality criteria were excluded from further processing with the RMasSBank workow.The automatic quality control procedure is presented in greater detail with plots demonstrating how the procedure would proceed in a "pass" and "fail" scenario elsewhere. 38Prescreening the mass spectrometry data is crucial to remove spectra resulting from compounds that do not satisfy the quality criteria, especially in the case of complex mixtures as measured here.

Record generation with RMassBank and RMB-mix
Spectra that satised the Shinyscreen quality criteria were further processed with the RMassBank workow for record generation, via the RMB-mix package. 32RMB-mix is a Shinyscreen-assisted RMassBank workow, which uses the prescreening summary output to exclude compounds of bad quality prior to generating MS/MS spectral records that users can then upload to MassBank.This package adapted the RMassBank workow for a high-throughput record generation workow of mixtures, since the original RMassBank was designed to work on single standards.In the RMassBank package, functions include extracting tandem MS spectra, assigning formulas to fragments, calibrating tandem MS spectra with fragments, cleaning spectral data, retrieving compound information from online databases, and exporting compound information to MassBank. 31In addition to the mzML les and compound list, this workow requires an additional input le known as a settings le.The settings le contains information like ionization mode, authors, instrumentation type, solvent gradient, chromatographic specication, etc.This is saved as 'settings.ini'and denes the information included in the generated MassBank records.An example settings le is provided in the GNPS repository associated with this article 37 (see also "Data availability").
The RMB-mix method package runs on a slightly modied version of the RMassBank (ver.2.99.2) code to handle complex mixtures like ENTACT data.Record generation involves a twostep computational workow: (i) the MSMS workow, and (ii) the generation of MassBank records.In the rst step, MS/MS data is extracted from the raw les, which are then recalibrated, denoised, and annotated with sub-formulas using the RMassBank procedures. 31The second step entails the annotation of peak lists using various web services and the settings le provided by the user.The SMILES in the input CSV is used to query information from the databases, via the InChIKey.The SMILES are converted into the InChIKey using CACTUS. 31he InChIKey is then used to retrieve synonyms, IUPAC names, and other identiers from the Chemical Translation Service and PubChem. 31This is streamlined with the chemical compound information provided in the input compound lists and included in an output le and the corresponding Mass-Bank records.

Record validation with ReSOLUTION and MassBank validator
Once MassBank record les were generated, a quality check was carried out using the ReSOLUTION package.ReSOLUTION is an R package that contains functions to summarize the records, giving an overview of different elds within the text les.The getMBRecordInfo function of the ReSOLUTION package produces a CSV le that contains entries from all MassBank records within the specied directory that match the specied list of MassBank record elds.The summary le was manually veried to remove any spectra with a base peak intensity below 10 5 and check for any other artefacts.
Following that, an additional validation step was performed to check if the records met the MassBank data criteria, as a part of pre-submission checks.The directory was set up for validating the records by downloading data from the MassBankdata repository on GitHub and the MassBank validation was performed using a bash script.Once all checks were passed, the spectra were submitted to MassBank via a GitHub pull request.

MassBank-PubChem integration
The ReSOLUTION-based record validation was modied to extract data from the entire MassBank-data repository and produce summary les to enable the integration of MassBank.EU record previews in individual compound records in Pub-Chem.Aer several rounds of display optimization with PubChem, the following elds were agreed upon (where available): Accession ID, Authors, Instrument, Instrument Type, MS Level, Ionization Mode, Ionization, Collision Energy, Fragmentation Mode, Column Name, Retention Time, Precursor m/ z, Precursor Adduct, License, Publication, SPLASH (the SPectraL hASH 39 ) and the Top-5 Peaks for display purposes, in addition to the Name, SMILES, InChI, InChIKey elds for structural information (to map to the correct PubChem record).Compound class information is also extracted but not currently used.The Top-5 peaks were chosen as the optimal display experience, as the Top-3 were determined to be too few, while the Top-10 created many overlaps in the spectral images and resulted in a lengthy data display.Trimming was also necessary to avoid issues with spectra that contained hundreds of peaks.
The entire MassBank summary le retrieved via ReSOLU-TION forms the basis to create a substance le for deposition within PubChem.This takes the compound elds (SMILES, InChI and InChIKey), performs several clean-up steps (e.g., removing "N/A" entries, SMILES with wildcards, and duplicate entries), and creates a le for deposition.During this process several issues were found with both MassBank.EU entries and the PubChem deposition checks.All errors in MassBank that could be addressed were corrected aer discussion with contributors (see MassBank-data issues), while several updates were also made to the PubChem deposition process.Some MassBank entries had invalid SMILES that would not pass the improved deposition system, but could not be xed due to conditions on the contributed MassBank record.These SMILES were added to a "bad list" and are removed during creation of the deposition le.The code for all these steps is available on the ECI GitLab repository 40 (see "Data availability").
Once the MS/MS and substance les were created, they were uploaded to Zenodo. 41An FTP service was set up to transfer the substance le automatically to PubChem when a new le is available.This data deposition can also be uploaded manually.The annotation is coordinated via a mapping le to the corresponding le on Zenodo, hosted on the ECI GitLab 40 (see "Data availability").Once a new MS/MS le is available it is parsed and updated at the next weekly cycle.
The spectra are rendered in PubChem on the y with a simple CGI interface created as part of these efforts, but later expanded to display other contents in PubChem. 24The Top 5 peak display is designed to enable simple copy-paste into downstream applications.Major elds such as Accession ID and SPLASH are hyperlinked automatically to direct PubChem users back to the original MassBank data to retrieve the entire record.

Prescreening of ENTACT mixtures
Fig. 2 shows the number of compounds that passed the quality control check at various collision energies for each mix in positive ESI mode.Fig. S3 (ESI †) shows the same for negative mode; the results are also summarized in Table S1.† The detection rate of each mode ranged between 45% and 55% per collision energy.As commonly observed in ESI, the number of compounds detected in positive mode was higher than in negative mode.

MS/MS spectra record generation
The RMB-mix method for generating MS/MS spectral records was carried out on all the mixes.Aer recalibration and annotation, spectral records were generated for the compounds that passed the RMassBank workow.
Fig. 3A shows the total number of compounds for which records were generated following the RMassBank workow per mode (783 of the 1268 compounds in total), while Fig. 3B shows an overview of the corresponding number of spectral records that were generated (up to six records per compound and mode, due to the different collision energies).The records generated in the year 2020 included isobars (Records_Negative_2020 & Records_Positive_2020) in the compound list.Based on user feedback, this original data set contained ambiguous spectral records.As a result, the QA/QC procedures were tightened and isobaric substances were excluded.The workow was rerun in 2021 to remove any ambiguities and to generate clean spectral records (Records_Positive_2021 and Records_Negative_2021).The procedure to remove ambiguities in the isobaric entries is explained in the next section with some examples.

Characterisation of isobars
The results generated from the prescreening analysis were narrowed down to track the compounds with the same exact mass (isomers) or with a similar exact mass within the instrumental parameters (termed isobars for the purpose of this article).Isomers are compounds with the exact same mass but different structures.Isobars may not necessarily have the exact same mass but may be sufficiently close in exact mass that they are within the precursor isolation window of 1 Da and thus mixed spectra could be created if such isobars are present in the same mix with the same retention time.Since the ENTACT standards were mixtures, and the retention times (RTs) of the individual substances for the chromatographic system were not known, the presence of one or more isomers or isobars (see e.g.Fig. 4) in a given mix could give rise to: (a) Distinct spectra for each isomer/isobar with separate RTs, without knowing which peak belongs to which isomer/isobar; (b) Fewer spectra than isomers/isobars, without knowing which compounds could be detected; (c) No spectra (trivial casenone of the compounds were detected).
In terms of reporting the results to the ENTACT organisers, all cases with more than one compound possible for more than one peak were reported as a level 3 condence level (e.g., Fig. 4, top case), whereas clear identications (e.g., Fig. 4, bottom case) were reported at level 1 condence. 19Fig. 5 summarises the identication levels of the compounds per mix and mode.Yellow indicates level 1 identications (no isobars present), grey indicates level 3 (one or more isobars present), while black indicates the number of compounds not detected in that ionisation mode.
As observed above, there were generally fewer detections in negative mode, while the more complicated mixes 507 and 508 had proportionally more isobars than the simpler mixes.Mix 506 also had many level 3 identications, especially in negative mode.

Summary of the contribution to MassBank
The results of this study are summarized as follows: aer prescreening using Shinyscreen, 750 compounds in positive mode and 587 compounds in negative mode met all the quality criteria, enabling the RMassBank workow to be executed on these compounds prior to recalibration with the RMB mix method.The rst set of 7299 spectral records (both positive and negative modes) was published on August 21, 2020.Upon characterization of the isobars in the mixes, ambiguous spectra were excluded from the database, leaving only clean spectra from 590 compounds in positive mode and 379 compounds in negative mode (including the overlap, see Fig. 3), which generated 3411 and 2171 spectral records respectively.The results of these analyses were updated in the MassBank database on 28 January 2021.In order to maintain the high quality of the MassBank database, this rigorous process ensured the quality and accuracy of all spectral records.The feedback from users also helped rene the quality control procedures applied to these complex ENTACT mixtures.The addition of 5582 spectral records was a valuable contribution to the public databases, given that the MassBank database contained only 519 ENTACT specic compounds (747 aer), and the PubChem database contained only 479 of these ENTACT compounds in the LC-MS category before these efforts (780 aer).

PubChem-MassBank integration
The inclusion of MassBank spectra in PubChem enabled the creation of spectral summary pages for each compound present in MassBank (see Fig. 6) and expanded both the collection of spectra in MassBank and the LC-MS category of data in PubChem (as discussed above).Since PubChem has millions of users a month, this helps provide an alternative way for researchers to nd the spectral data in MassBank.It also enables the creation of powerful queries to explore available information about these compounds, as shown for mix 504 in Fig. 7.

Discussion
Data processing in R The methodology and results specied in this study provide an insight into the systematic analysis of LC-MS data analysis and elaborate on the use of computational processing using R. 42 R plays a signicant role in cheminformatics and mass spectrometry via manipulating and analyzing chemical structures and data within the R environment.Many cheminformatics packages are available including rcdk, MSnbase, RMassBank and RChemMass, with several more packages available through CRAN, 43 BioConductor and GitHub.Integrating the above packages into the R environment facilitates the development of new strategies in data handling, manipulation and implementation of a graphical user interface to serve as a platform for analyzing LC-MS data, preprocessing and visualization. 31Shinyscreen was developed in-house with exibility and functionality, which excludes the need for programming experience in accessing the tools and can help overcome several of the inconveniences associated with package dependencies in R that can be exasperating for novice users.The key functional aspects of Shinyscreen most applicable to the work presented in this study include efficient data extraction, i.e., exporting the data les to the interface and saving the conguration setting, providing initial quality control settings, visualizing the spectrum as a PDF le and saving the summary of results as a CSV le, all of which can be exported for further analysis.The efforts of manual prescreening using vendor soware such as Xcalibur from Thermo can be minimized with the semi-automated Shinyscreen application.This is especially useful for complex datasets like ENTACT, because the preprocessing of data and visualizing the alignment of precursor and fragment ion retention times are simpler.The graphical user interface (GUI) for visualizing the MS1, MS/MS and peak alignment gives rsthand information on the compounds detected in the LC-MS instrument.Automatic quality control based on preset threshold values ltered out compounds failing the basic criteria accurately and there was no need for a subsequent manual check of eliminated results (only those remaining).Shinyscreen-assisted quality control was thus crucial for analyzing the quality of the dataset used by ltering out compounds failing the quality criteria before the RMassBank workow.User feedback helped rene this process further, stimulating additional Shinyscreen developments.The RMassBank workow is independent of the Shinyscreen application.RMassBank is an R package that can be accessed through Bioconductor for the automatic recalibration and cleanup of spectral data due to noise or mass deviations.RMassBank can handle any number of les in one run relatively quickly, generating records and annotated compound information.Despite the semi-automated workow successfully generating clean MassBank records, manual annotation and DL-ascorbic acid when performing the database retrieval (Fig. 8B).
The correct match, with the same stereochemistry but without sodium and charges is erythorbic acid, with SMILES OC [C@@H](O)[C@H]1OC(=O)C(O)=C1O, shown in Fig. 8C.The fact that the MS-ready algorithm also removed stereochemistry hindered the recovery of information for many cases, since SMILES without stereochemistry were not always registered in databases for many cases (as they were not the complete/ common representations).This made the retrieval of compound identiers within the RMassBank workow a challenge.To resolve this, an algorithm developed in-house helped retain the stereochemistry information from the original mixture SMILES provided with ENTACT.This issue was also reported to the US EPA for consideration in future MS-ready SMILES algorithm developments.A new infolist was generated to resolve the naming issue for these compounds.This was complemented by manual checking of the names to ensure correct information was present.In future efforts, it is likely that this workow could be optimized further by adopting options like querying the Pub-Chem database using InChIKeys and taking the PubChem title, IUPAC name, or top 3 synonyms, which could improve the handling of synonym issues in RMassBank.
Working with the ENTACT dataset was challenging due to the presence of isobars and isomers, which complicated the identication of chemicals.Due to unresolved peak annotations, several isobars were excluded from the results to ensure  clean spectra in MassBank.For this reason, any isobars present were removed if the identity of the chemicals could not be determined (see e.g.Fig. 4).Ideally, to achieve accurate and clear analysis of data, individual standards should be used to distinguish each peak and effectively identify any isobars present in the sample, but in this case only the mixtures were available, not the individual ToxCast standards.These, however, have been made available to other ENTACT participants.Should this data also become publicly available, this would help further enhance the quality control procedures performed here and may also assist in identifying some of the spectra that have been removed due to uncertain identities at this stage.A good match between individual standard spectra and these spectra from mixtures could potentially assist in identifying isomers/isobars and help deduce their retention times on the chromatographic system applied here.

Practical application of the ENTACT data
The 10 ENTACT combinations contain 1268 compounds, many of which are toxicologically relevant substances.The development of this in-house spectral library, which has also been uploaded to MassBank, was designed to aid in the identication of unknown compounds by improving the identication condence via library match (level 2a) and even conrmation (level 1) with the retention time match if the same method was used.For external users of MassBank, this ENTACT dataset can provide level 2a, should acceptable match values be present.Since MassBank data is also integrated into other resources such as MassBank of North America 44 and GNPS, 45 as well as PubChem (as mentioned above), these spectra are now accessible to users worldwide.MassBank data is also integrated into MetFrag, 46 so that this is also accessible to non-target workows and other in silico methods.
The spectra uploaded as a part of this work have already supported other projects.This dataset served as an in-house reference library to identify pharmaceuticals 47 and pesticides 48 plus respective transformation products in Luxembourgish water samples with identication condence level 1, supported by other individual reference standards.It was also used to complement spectral library data in exposomics work investigating Parkinson's disease. 49Several other studies in progress are also using this data.Hence, the 5582 spectral records generated in this study have already proven to be a valuable addition in setting up a mass spectral library in-house, and have added to the public databases as well.
In NTA research, the use of reference library spectra is frequently required for condent identication of unknowns.The number of possible identications for each given study is limited due to the small quantity of available reference libraries. 50Creating a comprehensive chemical space is an enormous challenge, and to close the existing knowledge gap, it is imperative to share data across open access cheminformatics databases (PubChem, 24 CompTox 25 ) and repositories (NORMAN-SLE, Zenodo 51 ).To ensure a continuous ow of information, a workow has been established between Pub-Chem and MassBank Europe, enabling deposition of spectral records into PubChem and integration of accompanying (relevant) annotation content with full traceability to original data sources. 20The code related to deposition and annotation can be found in the PubChem-MassBankEU GitLab repository 40 and the les deposited on PubChem are available on Zenodo 41 (see also "Data availability").A key component of future efforts will be to continue to engage users and the community in developing tools and setting up databases that enable large compound knowledge bases for exposomics research, thus enabling researchers to interpret non-target HR-MS data in greater detail. 20

Conclusions
The exposome is a broad category and one of the major challenges in the non-targeted approach of exposome studies is the high complexity of data involved and the need for the development of consolidated MS/MS libraries for annotating the unknown compounds.This effort built a mass spectral library using the ENTACT dataset using different analytical and computational tools.The Shinyscreen and RMB-mix method packages ensured good quality MS/MS spectral record generation to set up the in-house spectral library.To date, ENTACT is one of the most extensive studies to use synthetic mixtures and multiple reference media samples to evaluate non-targeted approaches.Ideally, this will drastically improve the exposome research in exposure assessment, enhance the prediction and identication of risk factors and monitor policy results in reducing exposures. 1ENTACT data are toxicologically relevant and setting up a database of spectral records will help in the identication of unknown compounds.This is an important contribution for database search as there are few compounds available with public library spectra and not many ENTACT chemicals were in the database prior to these efforts.Together, these expansions will bring about an advancement in the nontargeted analysis.Although including ENTACT substances in the public record was a long-declared aim of the ENTACT organizers, the dataset committed as part of this study actually constitutes, to the best of our knowledge, the rst public MS/MS dataset resulting from this trial.MassBank is currently preparing to accept further ENTACT contributions from other sources including the US EPA, which will enhance the use of this dataset by further contributing spectra measured on different instruments and under different conditions, further supporting robust identication efforts in exposomics.View Article Online making Mass-Bank a truly open access, open data, open source resource.As such, the spectral records included in MassBank are available for use by researchers in various analytical and computational workows.MassBank is an important resource of spectra for metabolomics, environmental and exposomics studies.Mass-Bank records are text les containing compound metadata and mass spectral information in the rich MassBank record format.

Fig. 2
Fig.2The number of compounds that pass the QC across multiple collision energies in positive mode.Negative data is given in Fig.S3.†

Fig. 3
Fig. 3 Summary of the record generation results.(A) The number of compounds generated in each mode and overlap.(B) Summary of MassBank records before (2020) and after (2021) isobar characterisation.

Fig. 4
Fig. 4 Extracted Ion Chromatogram (EIC) of m/z 161.1070 in (top) mix 508 and (bottom) mix 503.In mix 503 only one compound with this mass is present (level 1), but another isobaric compound is present in mix 508.

Fig. 6
Fig. 6 Example of a MassBank record displayed in PubChem (arsanilic acid, CID 7389), with the Accession ID and SPLASH hyperlinked to the original MassBank record and an interactive thumbnail.

Fig. 7 A
Fig. 7 A) Exploring the entire ENTACT dataset on the PubChem table of contents (TOC) classification browser (https:// pubchem.ncbi.nlm.nih.gov/classification/#hid=72).Note: some categories have been trimmed for space reasons.(B) Classification of mix 504 of the ENTACT dataset using PubChem annotation (sub)categories.

Fig. 8 (
Fig. 8 (A) Substance according to the original SMILES in ToxCast, sodium erythorbate.(B) The MS-ready version of the same compound, DLascorbic acid with salt and stereochemistry removed.(C) The compound according to the adjusted SMILES, restoring the correct stereochemistryerythorbic acid.
DSSTox Distributed-structure-searchable toxicity DTXCID DSSTox structure identier DTXSID DSSTox substance identier EC Emerging contamiants ECI Environmental cheminformatics group (at LCSB) EIC Extracted ion chromatogram ENTACT EPA non-target analysis collaborative trial ESI Electrospray ionization FTP File transfer protocol GC-MS Gas coupled mass spectrometry GNPS Global natural products social molecular networking GUI Graphical user interface HILIC Hydrophilic interaction chromatography HRMS High resolution mass spectrometry IUPAC International union of pure and applied chemistry InChI International chemical identier InChIKey IUPAC international chemical identier key LC-MS Liquid chromatography mass spectrometry LCSB Luxembourg centre for systems biomedicine MoNA MassBank of North America MS Mass spectrometry NIH National institute of health NTA Non-targeted analysis RPLC Reverse phase liquid chromatography RT Retention time (US) EPA (United States) Environment Protection Agency Paper Environmental Science: Processes & Impacts Open Access Article.Published on 10 July 2023.Downloaded on 3/18/2024 4:28:52 AM.This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.View Article Online Environmental Science: Processes & Impacts Paper Open Access Article.Published on 10 July 2023.Downloaded on 3/18/2024 4:28:52 AM.This article is licensed under a Creative Commons Attribution 3.0 Unported Licence.

Table 1
Summary of different ENTACT mixes, number of substances and complexity of the mixes