Extraction of chemical structures from literature and patent documents using open access chemistry toolkits: a case study with PFAS †

The extraction of chemical information from documents is a demanding task in cheminformatics due to the variety of text and image-based representations of chemistry. The present work describes the extraction of chemical compounds with unique chemical structures from the open access CORE (COnnecting REpositories) and Google Patents full text document repositories. The importance of structure normalization is demonstrated using three open access cheminformatics toolkits: the Chemistry Development Kit (CDK), RDKit and OpenChemLib (OCL). Each toolkit was used for structure parsing, normalization and subsequent substructure searching, using SMILES as structure representations of chemical molecules and International Chemical Identi ﬁ ers (InChIs) for comparison. Per-and poly ﬂ uoroalkyl substances (PFAS) were chosen as a case study to perform the substructure search, due to their high environmental relevance, their presence in both literature and patent corpuses, and the current lack of community consensus on their de ﬁ nition. Three di ﬀ erent structural de ﬁ nitions of PFAS were chosen to highlight the implications of various de ﬁ nitions from a cheminformatics perspective. Since CDK, RDKit and OCL implement di ﬀ erent criteria and methods for SMILES parsing and normalization, di ﬀ erent numbers of parsed compounds were extracted, which were then evaluated using the three PFAS de ﬁ nitions. A comparison of these toolkits and de ﬁ nitions is provided, along with a discussion of the implications for PFAS screening and text mining e ﬀ orts in cheminformatics. Finally, the extracted PFAS ( (cid:1) 1.7 M PFAS from patents and (cid:1) 27 K from CORE) were compared against various existing PFAS lists and are provided in various formats for further community research


Introduction
Per-and polyuoroalkyl substances (PFAS) are compounds of high public interest as there is increasing evidence that exposure to PFAS can lead to adverse human and environmental health effects. 1,2These concerns are accompanied by their documented accumulation in the environment (as so-called "forever chemicals") due to their widespread use and stability. 3Well-known PFAS include older PFAS such as PFOA (peruorooctanoic acid) and PFOS (peruorooctane sulfonic acid), as well as newer PFAS such as GenX (a replacement product for the older PFAS).There is strong regulatory debate about PFAS, including calls to regulate them as a class 4 and for better approaches to detect PFAS in humans and in the environment.Since PFAS and replacement PFAS products are a fastmoving business, cheminformatics tools are gaining importance in identifying candidate PFAS compounds from within scientic and other text sources such as patent repositories, including in-house condential business documentation.
Past efforts to identify and collect chemical structures of existing PFAS have resulted in several so-called "suspect" lists.The Organisation for Economic Co-operation and Development (OECD) released a PFAS list containing 4729 PFAS entities in 2017 (ref.5 and 6) (hereaer "OECDPFAS").The United States Environmental Protection Agency (EPA) "PFASMASTER" list currently (December 2021) contains 12 048 PFAS entries, 7 merged from several PFAS lists on the EPA CompTox Chemicals Dashboard. 8f these two lists, PFASMASTER contains 10 785 entries that can be represented by an International Chemical Identier (InChI), while the OECDPFAS list contains 3741 entries with an InChI, using versions downloaded from the EPA website on 2021-12-11 (ref.7 and 9) and provided in ref. 10 The other entities in the lists are substances without a clear composition, or with known composition that cannot be represented fully with an InChI.Of the 3741 OECD compounds with an InChI, 3731 are also contained in the PFASMASTER list (by matching InChI).
These lists and more are used in environmental assessments to gauge the extent of the "PFAS knowledge gap".Such lists serve additional purposes, e.g., to search for the respective compounds in analytical data of environmental samples. 11The majority of PFAS suspect lists are hand curated, painstakingly compiled by experts and thus limited both by access to relevant information and by the manual nature of the efforts.Since the current denition of PFAS is strongly debated by the community, three different structural denitions of PFAS in use have been considered in this case study, claried below and shown in Fig. 1: Each compound that contains a CF 2 group is considered a PFAS.This denition has been proposed recently by the OECD. 12,13his denition will lead to a large amount of chemicals that are considered to be PFAS.

Denition B
Each compound that contains a (AH)(AH)(F)C-C(AH)F 2 group is considered a PFAS, where the AH groups could be hydrogen or any other atom and the bond between both aliphatic carbon atoms is a single bond.This denition is used in this present work as a straightforward structural denition as a compromise between denitions A and C.

Denition C
Each compound that contains a (R 1 )(R 2 )(F)C-C(R 3 )F 2 group is considered a PFAS, where the R groups are any atom except hydrogen and the bond between both aliphatic carbon atoms is a single bond.This is a new, very recent EPA denition. 14,15This denition will lead to the least amount of PFAS molecules.
Extracting chemical information from text documents is a challenging task.Unlike other natural language terms, chemistry-related terms pose additional challenges, as the number of known chemical compounds with unique structures is not only very high (e.g.PubChem 16 currently contains 111 M unique compounds, which is only a tiny fraction of the estimated chemical space) but they may appear in text documents with a multiplicity of trivial names.Examples include per-uorooctanesulfonic acid (PFOS), International Union of Pure and Applied Chemistry (IUPAC) names (e.g.1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8-heptadecauorooctane-1sulfonic acid), mixtures of trivial and IUPAC naming, enumerations of Markush 17 structures, trade names and half formulas (e.g.Krytox oils, F-(CF(CF 3 )-CF 2 -O) n -CF 2 CF 3 where n ¼ 10-60), database identiers such as Chemical Abstract Service (CAS) registry numbers (e.g.1763-23-1), PubChem Compound Iden-tiers (CIDs, e.g.74483), and even images that are referenced in the text with simple numeric labels.Advanced and exible methods are required to capture all types of chemical information, with subsequent cheminformatic manipulation to ensure correct mapping to detailed structural information.
The automated analysis of the increasing number of accessible scientic documents may provide input to fuel scientic studies to identify novel molecules with potentially desired or undesired properties.OCjprocessor 18 is a modular semantic annotation toolkit, based on Apache UIMA. 19It is designed to annotate different document types such as PDF, images, HTML, XML, MS Office and plain text documents.It uses a range of established dictionaries and ontologies as well as rule-based algorithms to annotate and index scientic named entities such as diseases, genes, species and chemistry.The properties of concept synonyms as well as the hierarchy of ontological concepts are taken into account to provide more accurate context sensitive annotation.For example, the term "sting" could be annotated as a known musician, a species, a disease or a protein.
OCjprocessor disambiguates based on the term environment and the presence of related concepts, assigning the annotation/ knowledge domain with the highest condence value.The precision and recall of OCjprocessor has been detailed elsewhere. 20For this study, the growing bodies of open access document repository CORE 21,22 (COnnecting REpositories) and patent full text documents in Google Patents 23 were selected to demonstrate the automated capability of identifying and analyzing scientic entities, applied to the case study of potential PFAS in documents.OCjprocessor 18 was used to automatically identify and extract mentions of chemical compounds from patents and other open access scientic documents such as scientic articles and university documents in CORE.The resulting collection of diverse chemical compounds was subsequently ltered for small molecule compounds for which a unique InChI 24 could be generated, thus removing incompletely-dened structures such as substances, polymers as well as mentions of chemical class terms and Markush-like 17 structures.Of the three denitions presented above, denition B was used for most of the detailed investigations in this study.The nal PFAS lists are available for all 3 denition versions described above and have been made public, together with additional results, in various formats 10,25 (see also data availability) for general assessment and as input for future studies.

Semantic annotation and extraction of chemical compounds
OCjprocessor 18 comprises various modules that take the different modalities of chemistry into account, aiming at a comprehensive annotation of chemistry terms in documents.This allows the identication of novel concepts and compounds that were not yet known at the time before annotating a given document.If new compounds are identied, these are registered in Google Big-Query 26 tables in the open access SciWalker-Open-Data project, giving access to >150 million small molecules with a unique standard InChI (version 1.03). 24These unique InChIs were generated from connection tables generated from the SMILES [27][28][29] representations of chemical structures.SMILES containing a wildcard entry (i.e."*") were considered as representing a scaffold containing an undened substituent and were not registered.Thus, the current approach is limited by the expressivity of SMILES as well as the InChI rules.For example, standard InChI will represent different tautomers of a molecule as one unique structure, while neither SMILES nor InChI consider coordinate (dative) or hydrogen bonds.Since valence isomerism is not handled by either system, this would result in different structures for molecules exhibiting valence isomerism. 30Hereaer, the use of "unique InChI" or InChI in this manuscript refers to a unique standard InChI (version 1.03).Document sets CORE documents.A total of 71 963 421 de-duplicated documents were selected and downloaded from the CORE document set of open access documents. 22These documents, when annotated with OCjprocessor, resulted in the annotation of 818 280 compounds with an unique InChI. 31The SMILES extracted from CORE are from the text only, images were not extracted.
Patent documents.Google Patents contains over 120 million patent publications from 100+ patent offices worldwide, available for open access searching. 23For the current work, a set of 111 730 728 Google Patent documents semantically annotated with OCjprocessor in May 2021 using both the text and images found in these patents was used.The resulting annotations are available in a BigQuery table 32 34 in the SciWalker-Open-Data project. 35As a next (pre-ltering) step, the 18 032 261 unique compounds from the chemistry annotations of patents were reduced to a dataset of 4 182 712 SMILES that contained an "F" character, resulting from a uorine, iron or francium atom.
The quality of the chemistry-related annotations from the combined text and image patent data is lower than from the CORE set.Optical structure recognition and extraction from images oen leads to erroneous structures such as compounds containing hypervalent atoms or wrong isotopes that arise from poor image quality.

Compound structure normalization
Normalization (or standardization) of compound structure representations is an important step in preparing compounds for further analysis, including reliable substructure searching.
Thus, the various effects of parsing the SMILES strings from the steps above to create a molecule object, plus subsequent normalization, were investigated using three different open access chemistry toolkits: RDKit (version 2020.03.2), 36 the Chemistry Development Kit (CDK, version 2.5) 37,38 and Open-ChemLib (OCL, version 2021.11.3). 39The approaches used were: RDKit: with the two available standardizers -molVS 40,41 and rdMol.
CDK: via SMILES parsing, normalizing the SMILES with the kekulize option.
OCL: via SMILES parsing and MoleculeStandardizer, writing the SMILES in a kekulized form.
Aer parsing the input SMILES, the resulting molecule object was again represented as SMILES as an intermediate step before parsing it again and performing the substructure search to classify it as a PFAS or non-PFAS.This procedure has an effect on the parsing results as described below; in a production environment this additional SMILES generation step would probably not be performed.

PFAS substructure search with graph-based atom-by-atomsearch (ABAS)
In-house Java code calling the respective CDK and OCL libraries and python scripts based on RDKit were used for the substructure calculations. 42To ensure that the substructure atom-by-atom-search (ABAS) graph based subroutines were implemented correctly, the code was tested using the query and SMILES set mentioned in the RDKit manual.

PFAS substructure search with ngerprint selection and ABAS
As a rst step, molecular ngerprints were calculated for the extracted molecular structures to create a Lucene search index using Apache Lucene in the following manner.Fingerprints (FP) were calculated by the respective toolkit libraries as shown in Table 1.These ngerprints were then stored for each molecule as a "document" in a Lucene index, providing the necessary ngerprint index of the molecules.The ngerprint of the substructure query was then calculated in the same way, followed by searching the Lucene index for candidates.In a second step, the resulting candidate compounds were ltered by ABAS graph-based substructure search from above.Molecules passing both steps were considered as hits.This approach has recently been implemented in Sachem 43 storing ngerprint data in an experimental Lucene implementation ported to C. In this study, a standard Lucene implementation in Java 1.8 was used with ngerprint libraries pattern ngerprinter (RDKit), DescriptorHandlerLongFFP512 (OCL) and CDKFingerprinter (CDK).The pattern ngerprint of RDKit uses SMARTS pattern to generate topological ngerprints of molecules. 44The DescriptorHandlerLongFFP512 of OCL is a binary ngerprint that depends on a dictionary of 512 predened structure fragments. 45The CDKFingerprinter generates one-dimensional bit arrays, where bits are assigned based on the presence of a certain structural feature in a compound. 46The molecules were normalized using the options available in OCL and CDK, and the molVS standardizer for RDKit.

Compound structure normalization
Several instances of different cheminformatics toolkits producing different normalized SMILES expressions were found.These inconsistencies inuence later results and are described below with specic examples.
Invalid SMILES expressions.A particular SMILES may contain expressions that are not compliant with the official SMILES denitions, which should either be rejected or elicit a warning from a SMILES parser.For example, while C[N@@@H]C is not a syntactically proper SMILES, it is nevertheless accepted by the commercial toolkit ChemAxon 47 as well as CDK, which transform it to [#6;A][#7;AH1;@@@][#6;A] or C*C, respectively, which is likely something entirely different than what was originally intended.However, C[N@@@H]C is rejected by the RDKit and OCL parsers, which is likely a more reasonable behaviour.
Valence rule violations.While an extracted and parsed SMILES may be formally correct when generated by chemistryrecognizing annotation modules, such as the optical structure recognition soware OSRA 48,49 for image-to-structure conversion, the resulting molecular structure may violate obvious valence bond order rules.For example, the OSRA input SMILES (see Fig. 3A) (RDKit), respectively.However, it is rejected by CDK, as it can not assign a valid Kekulé structure to a 5-membered aromatic ring containing a triple bondrepresenting an abnormal valence.While this behaviour may be intended (or even desired), the end result is that it changes the input SMILES to a different output SMILES, which results in a different chemical structure and thus different InChI.In other words, it changes the meaning of the input to an assumed desired output.Ideally, such changes/ corrections should be separated out into an optional module that can be switched on or off by the user of that toolkit, to enable better control over such behaviour depending on the use case.
The number of molecules rejected by parsing the SMILES with the different toolkits is quite different.A rejected SMILES cannot be used for subsequent substructure search, potentially reducing the number of identied PFAS molecules.Thus, the quality of the different SMILES parsers was checked by rst parsing the input SMILES, then generating the corresponding InChI from the molecule object.In a second step, a normalized SMILES was written from the molecule object, parsed again and the InChI of these "reparsed" SMILES was calculated.Discrepancies between the InChIs from step one and step two in this procedure reveal issues in the quality of the parsing.
Normalization.For the purposes of further comparison, normalization or standardization of the SMILES input is needed, as the same molecule can be represented by different SMILES.While the terms "normalization" and "standardiza-  O)).Since each chemistry toolkit uses somewhat different rules to normalize SMILES, this has an effect on the outcomes on the PFAS substructure search described below.Some normalization tasks may also be performed by specic "standardizer" modules of the toolkits that use rules (with varying degrees of available documentation) to transform SMILES into a normalized form.

PFAS substructure search (denition B) and effect of prior normalization
The effect of normalization on the PFAS substructure search using denition B (Fig. 1B) on the CORE dataset is given in Table 1.The maximum number of unique PFAS compounds found by CDK and OCL using normalization is the same, i.e. 4192 PFAS (according to denition B).RDKit nds one structure less, which has a SMILES ClFC(F)C(F)(F)Cl (OCID190000011511).This compound structure is actually an incorrect representation of 1,2-dichlorotetrauoroethane, containing a hypervalent uorine (see Fig. 2B).This structure was integrated into the OntoChem database of registered compounds when it was found in an early version of the Wikipedia Chemical infobox. 50Meanwhile, this entry has been corrected in Wikipedia Chemistry but still remains as a legacy in the OntoChem compound registry system, waiting for relinking to the correct structure and respective OCID190005899464.
In general, the number of SMILES that are not accepted by the different toolkits as valid SMILES are quite different (see "Invalid" entries in Table 1) and also depend on whether or not normalization is used.CDK seems to be more "forgiving" than RDKit and OCL, but only if normalization is used.
Of the 7 SMILES in CDK that are characterized as invalid SMILES representations with normalization, 6 are ferrocenes with coordinative bonds, such as [Fe].Cc1ccc(C)c1.Cc1ccc(C)c1 (OCID190071023137, see Fig. 2C).A meaningful ferrocene SMILES should have an iron with 2 positive charges and two cyclopentadienes with a negative charge like for example [Fe++].CC1¼CC¼C(C)[C-]1.CC2¼CC¼C(C)[C-]2 (see Fig. 2D), however this "correct" SMILES does not truly reect the aromatic structure with a distributed negative charge and its coordinative bonding nature.This problem will be seen for all coordinative compounds, as the current SMILES syntax does not allow for coordinative or hydrogen bonds like they are available in the MDL MOL le version V3000 denitions. 51This is a serious deciency of the current SMILES notation, excluding most metal complexes from the universe of SMILES and InChI descriptions, and is a topic under discussion within the InChI committee and IUPAC.The 7 th invalid SMILES was generated by OSRA, with the hypervalent carbon atoms as shown and discussed in Fig. 3A above (OCID190014261931).For the 254 SMILES that were found to be invalid SMILES representations by OCL with normalization, all 254 contained an aromatic selenium atom "[se]" in a kekulized, non-aromatic SMILES string.In our opinion, this behaviour is correct, as there is no such thing as a single aromatic atom in a non-aromatic environment.However, this [se] is corrected to [Se] by the other toolkits at the normalization stage.In addition, the nonnormalized OCL version nds 259 invalid SMILESthe 254 are as for the normalized OCL, while these 5 additional SMILES include atoms with excessive charges such as [As+8], [As+9], [O+8], [O+9], [I+9], which are corrected to their uncharged forms by the normalizera behaviour which likely undesirable.The invalid SMILES for CDK (7) and OCL (254) with normalization are the result of the initial SMILES parsing.The invalid SMILES from RDKit were not investigated further, however, these are provided in ref. 10. for further inspection.It is interesting to note that the number of PFAS compounds does not change when using OCL or RDKit, irrespective of whether normalization is applied or not.However, CDK clearly needs a structure normalization before performing substructure searching.

Mixed toolkit normalization and substructure searching on the CORE dataset
Table 2 presents the results of using different combinations of toolkits for the normalization and subsequent substructure search engines.The rst line per toolkit (two lines in the case of RDKit) repeats the results from Table 1, where the normalization and substructure search is performed by the same toolkit.As for Table 1, denition B was used for parsing the PFAS query against the 818 280 CORE compound dataset.
For the CDK, while the combination of RDKit normalization and CDK substructure search does not appear to work well together, the CDK substructure search works well with its own CDK as well as with OCL normalization.For the OCL results, it is interesting to note that the syntactically wrong SMILES with aromatic selenium mentioned above are corrected to non-aromatic by CDK, therefore reducing the number of invalid SMILES for the CDK + OCL combination.For the RDKit results, while the number of identied PFAS molecules was not inuenced by the normalization used, the least invalid SMILES were found when using RDKit for both normalization and substructure search.Since the molVS model from RDKit returned fewer invalid entries but the same number of PFAS, this was used subsequently.Not surprisingly, Table 2 shows that it seems to be meaningful to take normalization and substructure search from the same toolkit.

PFAS substructure search (denition B) on the patent dataset
Using the insights gained from Table 2, the larger, more heterogeneous SMILES data set of 4 182 712 SMILES from the patent extraction was investigated.The results of normalization and PFAS substructure search using the CDK, OCL and RDKit toolkits are shown in Table 3.
Inspecting the invalid 36 SMILES obtained for the CDK results revealed that all structures are ferrocene type compounds as already observed with the CORE dataset.Of the 263 invalid OCL SMILES, 237 were the already known problematic aromatic selenium compounds within a non-aromatic SMILES, 25 had problems with the assignment of aromatic bonds, while one SMILES contained an incorrect nitrogen notation "[N-13]".Again, it is interesting to note that the results from OCL and CDK are very close to each other.The invalid RDKit SMILES were too numerous for (detailed) further inspection, but are provided in ref. 10.

PFAS substructure search and effect of prior ngerprint selection
Tools that implement substructure searching for large chemical databases perform this task typically in two steps -rst, ngerprints are generated and searched for a list of candidate molecules for step two, a full graph-based search also known as atom-by-atom search (ABAS).The reason for this is that ABAS is a NP complete problem and such searches can take quite some time, depending on the query structure.Thus, to achieve reasonable search results in a short time, the number of ABAS searches needs to be reduced to a minimum, which is achieved by a fast ngerprint compound pre-selection step.Thus, ngerprints should deliver a superset of compound candidates, which are then narrowed down by ABAS to the set of compounds that truly contain that substructure.The smaller the difference between this initial ngerprint list and the number of nal compounds, the better and thus the more efficient the applied ngerprint algorithm.As a consequence, many ngerprint algorithms have been developed and optimized for preselection.
It is not the goal of this work to qualify and compare different ngerprint algorithms, since the described substructure search results were obtained with an ABAS on all compounds of interest (not only on a subset), as accurate results were the prime interest and search time was not an issue.However, a combined compound normalization + ngerprinting + substructure search process was also used to identify PFAS compounds from the extracted structures, as this method would probably be used in the future by typical chemistry database users to identify PFAS compounds.Table 4 shows the effect of ngerprint screening in substructure search for PFAS denitions A, B and C across the two compound datasets (CORE and Patents).It is interesting to note that the combined use of ngerprint selection and subsequent substructure search on the selected list resulted in quite comparable results for all the chemistry toolkits when using the higher quality CORE dataset.The number of identied PFAS is the same for CDK and OCL, slightly lower for RDKit.The CDK ngerprint selection appears to be more efficient than using the OCL or RDKit ngerprints for PFAS denition A and B. For the more strict denition C, OCL ngerprints are most selective.Not surprising is the lower number of identied PFAS for the more heterogeneous patent SMILES dataset, since more molecules are sorted out by the RDKit parser as shown in Table 4.
The results of PFAS selection with the combined use of ngerprints and subsequent ABAS selection correspond exactly to the results when using ABAS on all input moleculeswith one exception of CDK for denition A where the direct ABAS search nds one structure in addition to the ngerprint + ABAS process, which is OCID190080191030 (PubChem CID 117959248) with a very extensive polycyclic aromatic structure, shown in Fig. 2E.
Finalized PFAS CORE and patent lists via OCL Since compound structures may be described by syntactically correct SMILES strings but these may represent non-existing compounds, for example if they contain hypervalent atoms or non-existing isotopes (as discussed above), a nal cleaning step was implemented based on the results above.Both input sets from CORE and Patents from above were used, along with the following procedure to derive a dataset of both valid normalized and standardized SMILES of PFAS classied molecules according to the three denitions using the OCL toolkit: Parsing the input SMILES and eliminating erroneous wrong compound structures with hypervalent atoms or wrong isotopes Calculating the standard InChI of the input SMILES ("InChI-1") Standardizing the parsed SMILES molecule object, writing a standardized SMILES and calculating the standard InChI of the standardized SMILES ("InChI-2") De-duplicating structures based on "InChI-2" Running a ABAS substructure query on the standardized SMILES for PFAS denition A, B and C.
In the CORE set 974 structures were found with a wrong SMILES and 25 627 structures with a changed InChI aer normalization using OCLthese were removed from the datasets.In the patent set, 108 492 structures had incorrect SMILES and 81 272 structures had a changed InChI aer normalization with OCL.
The results of the normalized structures classied as PFAS are shown in Table 5 and compared with the existing PFAS-MASTER and OECDPFAS lists (mentioned in the introduction) by InChIKey.The number of entries missing from PubChem was determined by matching InChIKeys in each PFAS dataset and the OCID-PubChem dataset in sciwalker: sciwalker-opendata.chemistry_compounds.ocid_pubchem_cid.
The overlap of the PFAS in the CORE and patent datasets for the different denitions were (A) 12 876; (B) 1806; and (C) 866 PFAS entries, showing that the extraction of data from different sources reveals highly complementary results.
The overlaps between the lists extracted here and the existing PFAS lists were much lower than expected.Likewise more entries were missing from PubChem than originally expected, especially for the CORE database.The results were reality checkedhere documented with an example for the CORE set using the stringent denition C (915 compounds not in Pub-Chem).
One of these 915 compounds includes OCID190080091261 (InChIKey LZICQIXBOVBGMV-UHFFFAOYSA-N), shown in Fig. 4.This was published in a PhD thesis 52 in Chemistry and extracted from the document section IV.Experimental part 240 16.8.2via name to structure from "Trimethyl({4 0 -[ (7,7,8,8,9,9,10,10,11,11,12,12,12-trideca-uorododecyl)oxy]-1,1 0 -biphenyl-4-yl}ethynyl)silane", which has been interpreted correctly.This shows the potential for literature mining to capture structures that are real and worthy of further investigation, but not yet known to PFAS researchers or to large open databases such as PubChem.To enhance the discovery of these PFAS in environmental samples, both datasets have been made available as CSV les 25 for use in mass spectrometry-based screening approaches, such as MetFrag 53 and patRoon. 54Two separate les have been created, for the CORE and patent datasets respectivelywith each entry tagged according to the PFAS denition that the given structure satises.The CORE dataset additionally includes the number of references in which the structure was found, which can be used for prioritization of candidate matches.The les were formatted as a MetFrag localCSV, where all entries that cause MetFrag to fail (formulas with digits preceding the carbon; certain unusual elements as removed in PubChemLite 55 ) were removed.Where available, names and CIDs were lled in via PubChem, otherwise the OCID was assigned as a name.The resulting les contained 26 695 entries for CORE (of which 5903 entries are without CIDs and 363 entries were removed from the original CORE list) and 1 778 470 entries for patents (of which 85 277 are without CIDs and 5181 entries were removed).The number of PubChem CIDs is higher than above due to the different style of querying; here a combination of FTP les (InChIKey to CID mapping) and REST API (SMILES to CID mapping for remaining entries without CIDs) was used, as the REST API offers the SMILES standardization to match with the nal version in PubChem.For the original lists, 5937 CIDs were missing in the CORE set of 27 058 SMILES (21.9%), while 85 472 CIDs were missing in the patents set of 1 783 651 SMILES (4.8%).The ratio of missing CIDs was very similar in the nal MetFrag les.Both datasets were deposited to PubChem (Feb.12, 2022, submissions 112 615 and 112 624) to ll these gaps and the CID mappings were updated on April 20, 2022 to include these new CIDs.The MetFrag CSV les are available on Zenodo 25 for use in all mass spectrometry workows, and are also available in the dropdown menu of the MetFrag Web interface (https://msbi.ipb-halle.de/MetFrag/).

Comparison of CORE with OECDPFAS classication
Finally, the PFAS structures extracted from the CORE database were investigated using the OECDPFAS classication system via the PubChem Classication Browser 56 to determine whether particular PFAS classes were under or over-represented in the extracted data sets compared with the entire OECDPFAS list.The CORE set of 27 058 InChIKeys was uploaded to the Pub-Chem ID Exchange, 57 which returned 20 907 matches via Entrez History.This was then used to browse the NORMAN SLE Clas-sication tree in PubChem. 56Since the inuence of searching via InChIKey rst block (structural skeleton) versus full InChI-Key was not dramatic (only an additional 44 entries found, see row 1 of Table 5), this analysis was kept at the InChIKey level for consistency with the rest of this article.The OECDPFAS list is split into many categories; of primary interest for data extraction is the "Structure Category", which covers 8 major PFAS categories (denoted 100 through 800), with several subcategories in each.The major categories and the number of matches in CORE are shown in Table 6.
Table 6 shows that PFAS in the categories 200, 300 and 600 are found quite well in the CORE documents (approx.40% coverage).In contrast, categories 500 (per-and polyuoroalkyl ether-based compounds) and 700 (semiuorinated per-uoroalkyl acid (PFAA) precursors), are underrepresented (16 and 12%, respectively).Even within categories, different subcategories were underrepresented, for instance very few entries were found from subcategory 103 "other peruoroalkyl carbonyl-based nonpolymers" (only 13 of 168 entries in OECDPFAS, i.e. 8%).Likewise, only 3 of 127 (2%) of subcategory 701.2 "Semi-uorinated alkanes (SFAs) and derivatives (n $ 4)" were found, and only 26 of 405 (6%) of 705 "side-chain uorinated aromatics".It would be interesting future work to investigate whether the CORE and patent datasets could capture additional knowledge to add more PFAS to these categories, for instance by expanding the "splitPFAS" work at categorizing PFAS 58 (prototyped so far on only 4 of the OECDPFAS categories) for this context.a Prior to deposition of the entire dataset to PubChem, to ll these gaps.
Fig. 4 A PFAS classified compound (all definitions) that was indexed in a CORE publication but is not in PubChem (OCID190080091261).

Conclusions
This article details methods to extract mentions of potential PFAS compounds and their structures as SMILES strings from scientic documents and patents, along with the use of three open access chemistry toolkits to identify PFAS structures in these compound lists by parsing, removing wrong structures, normalizing, standardizing and substructure searching these SMILES.Of the extracted mentions, FCC(F)(F)F [1,1,1,2-tetrauoroethane] was the most frequently detected compoundoverall 6323 times in the CORE dataset.The resulting PFAS lists have been compiled, together with their references and chemical structures using three different structural denitions of PFAS (A, B and C), where A is a very broad denition, B is a narrower denition and a subset of A, and C is a subset of B. These denitions came from the PFAS community, with A being recently proposed by the OECD, and both B and C deriving from denitions used by the US EPA.These denitions did not always contain sufficient cheminformatic detail to clarify certain edge cases, such as unsaturation or hybridization.As such, the results here are intended to contribute to the current debate surrounding the denition of PFAS and help further rene these denitions.The resulting PFAS lists have been compared with two of the largest publicly available lists of PFAS molecules, PFASMASTER from the US EPA and the OECDPFAS list, released by the OECD.The overlap between the lists and the data extracted from scientic documents and patents is lower than expected, showing that many molecules on these lists are not found in the scientic documents and patents investigated, while also many molecules from the document extraction are not found in the published PFAS lists.Several thousand were also not in PubChem, but have since been deposited.The CORE and Patents datasets have been provided as CSV les on Zenodo 25 for mass spectral screening.This information will add to the number of known potential PFAS substances and hopefully help contribute to alleviating the "PFAS knowledge gap".The provision of public datasets will allow the integration of this information into various non-target mass spectrometry workows, such as the open workows MetFrag 53 and patRoon, 54 thus enabling other researchers to investigate the potential occurrence of the identied PFAS compounds in humans and the environment in future studies.authors also thank Jane Frommer (Collabra) and the reviewers for their efforts and helpful comments.

Fig. 1
Fig. 1 Schematic representation of the PFAS definitions A, B and C considered in this work."AH" ¼ hydrogen or any other atom; R 1 , R 2 , R 3 represent any atom other than hydrogen.

Fig. 2 (
Fig. 2 (A) The structure to test the validity of substructure search algorithms.(B) Erroneous SMILES, i.e. an incorrect representation of 1,2dichlorotetrafluoroethane caught by RDKit.(C) Invalid SMILES representations of ferrocene-like compounds, caught by CDK.(D) "Correct" SMILES representation of ferrocene-like compounds, still demonstrating the limitation of SMILES in representing such compounds.(E) The structure captured by CDK with ABAS only, but not fingerprint (FP) + ABAS.
tion" can refer to different concepts in different contexts, they are used synonymously in this work.During normalization of SMILES, atomic charges and bond types may be changed.For example, a nitro group can be represented as either the charged form -[N+](¼O)[O-] or the neutral form -N(¼O)(¼O), both yielding different but valid SMILES strings with the same InChI, i.e., InChI ¼ 1S/NO2/c2-1-3.Normalizing these two SMILES representations into a consensus SMILES facilitates further processing, e.g. for identity, similarity or substructure searching.Normalization of SMILES may ag alkali metals that are incorrectly connected to O or N, incorrect amide tautomers, and elements rendered as hypervalent or with abnormal valencies.For example, OCL ags and returns an error message when alkali metals are incorrectly covalently bonded to oxygen or nitrogen (e.g.NaO).The consensus representation is [Na+][O-].Also, OCL ags and returns an error message when incorrect amide tautomers are parsed without a square bracket for the NH group.(e.g., N]COH or HNC(]O) are incorrect representations of [NH]C(]

Table 1
Effect of normalization and toolkit selection on substructure search corresponding to PFAS definition B in the 818 280 compound CORE dataset

Table 2
Effect of different normalization procedures prior to substructure search (SSS) with various combinations of CDK, OCL and RDKit normalizers and subsequent substructure searches using PFAS definition B. Kekulization in CDK is turned off for non-CDK standardizers.The top row for each toolkit (indicated in bold; two rows for RDKit) are as given in Table1

Table 3
Extracted PFAS from the 4 182 712 patent compound dataset using CDK, OCK and RDKit with PFAS definition B

Table 4
Efficacy of different fingerprints in pre-selection for substructure searching

Table 5
Finalized PFAS compound lists for the CORE and patent datasets according to definitions A, B and C, compared with the PFASMASTER and OECDPFAS (2021-12-11 versions).IKFB ¼ InChIKey first block (structural skeleton)

Table 6
56CDPFAS list overlap with CORE according to structure category via the S25 OECDPFAS 6 list in the PubChem Classification Browser.56Neithermapping captures polymers, due to use of InChIKeys.PFAA ¼ peruoroalkyl acids. a