Open Access Article
Shadrack J.
Barnabas
a,
Timo
Böhme
a,
Stephen K.
Boyer
b,
Matthias
Irmer
a,
Christoph
Ruttkies
a,
Ian
Wetherbee
c,
Todor
Kondić
d,
Emma L.
Schymanski
*d and
Lutz
Weber
*a
aOntoChem GmbH, Blücherstrasse 24, 06120 Halle (Saale), Germany. E-mail: lutz.weber@ontochem.com
bCollabra Inc., San Jose, CA 95120, USA
cGoogle LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
dLuxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6 avenue du Swing, L-4367 Belvaux, Luxembourg. E-mail: emma.schymanski@uni.lu
First published on 31st May 2022
The extraction of chemical information from documents is a demanding task in cheminformatics due to the variety of text and image-based representations of chemistry. The present work describes the extraction of chemical compounds with unique chemical structures from the open access CORE (COnnecting REpositories) and Google Patents full text document repositories. The importance of structure normalization is demonstrated using three open access cheminformatics toolkits: the Chemistry Development Kit (CDK), RDKit and OpenChemLib (OCL). Each toolkit was used for structure parsing, normalization and subsequent substructure searching, using SMILES as structure representations of chemical molecules and International Chemical Identifiers (InChIs) for comparison. Per- and polyfluoroalkyl substances (PFAS) were chosen as a case study to perform the substructure search, due to their high environmental relevance, their presence in both literature and patent corpuses, and the current lack of community consensus on their definition. Three different structural definitions of PFAS were chosen to highlight the implications of various definitions from a cheminformatics perspective. Since CDK, RDKit and OCL implement different criteria and methods for SMILES parsing and normalization, different numbers of parsed compounds were extracted, which were then evaluated using the three PFAS definitions. A comparison of these toolkits and definitions is provided, along with a discussion of the implications for PFAS screening and text mining efforts in cheminformatics. Finally, the extracted PFAS (∼1.7 M PFAS from patents and ∼27 K from CORE) were compared against various existing PFAS lists and are provided in various formats for further community research efforts.
Past efforts to identify and collect chemical structures of existing PFAS have resulted in several so-called “suspect” lists. The Organisation for Economic Co-operation and Development (OECD) released a PFAS list containing 4729 PFAS entities in 2017 (ref. 5 and 6) (hereafter “OECDPFAS”). The United States Environmental Protection Agency (EPA) “PFASMASTER” list currently (December 2021) contains 12
048 PFAS entries,7 merged from several PFAS lists on the EPA CompTox Chemicals Dashboard.8 Of these two lists, PFASMASTER contains 10
785 entries that can be represented by an International Chemical Identifier (InChI), while the OECDPFAS list contains 3741 entries with an InChI, using versions downloaded from the EPA website on 2021-12-11 (ref. 7 and 9) and provided in ref. 10 The other entities in the lists are substances without a clear composition, or with known composition that cannot be represented fully with an InChI. Of the 3741 OECD compounds with an InChI, 3731 are also contained in the PFASMASTER list (by matching InChI).
These lists and more are used in environmental assessments to gauge the extent of the “PFAS knowledge gap”. Such lists serve additional purposes, e.g., to search for the respective compounds in analytical data of environmental samples.11 The majority of PFAS suspect lists are hand curated, painstakingly compiled by experts and thus limited both by access to relevant information and by the manual nature of the efforts. Since the current definition of PFAS is strongly debated by the community, three different structural definitions of PFAS in use have been considered in this case study, clarified below and shown in Fig. 1:
![]() | ||
| Fig. 1 Schematic representation of the PFAS definitions A, B and C considered in this work. “AH” = hydrogen or any other atom; R1, R2, R3 represent any atom other than hydrogen. | ||
Extracting chemical information from text documents is a challenging task. Unlike other natural language terms, chemistry-related terms pose additional challenges, as the number of known chemical compounds with unique structures is not only very high (e.g. PubChem16 currently contains 111 M unique compounds, which is only a tiny fraction of the estimated chemical space) but they may appear in text documents with a multiplicity of trivial names. Examples include perfluorooctanesulfonic acid (PFOS), International Union of Pure and Applied Chemistry (IUPAC) names (e.g. 1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8-heptadecafluorooctane-1-sulfonic acid), mixtures of trivial and IUPAC naming, enumerations of Markush17 structures, trade names and half formulas (e.g. Krytox oils, F–(CF(CF3)–CF2–O)n–CF2CF3 where n = 10–60), database identifiers such as Chemical Abstract Service (CAS) registry numbers (e.g. 1763-23-1), PubChem Compound Identifiers (CIDs, e.g. 74483), and even images that are referenced in the text with simple numeric labels. Advanced and flexible methods are required to capture all types of chemical information, with subsequent cheminformatic manipulation to ensure correct mapping to detailed structural information.
The automated analysis of the increasing number of accessible scientific documents may provide input to fuel scientific studies to identify novel molecules with potentially desired or undesired properties. OC|processor18 is a modular semantic annotation toolkit, based on Apache UIMA.19 It is designed to annotate different document types such as PDF, images, HTML, XML, MS Office and plain text documents. It uses a range of established dictionaries and ontologies as well as rule-based algorithms to annotate and index scientific named entities such as diseases, genes, species and chemistry. The properties of concept synonyms as well as the hierarchy of ontological concepts are taken into account to provide more accurate context sensitive annotation. For example, the term “sting” could be annotated as a known musician, a species, a disease or a protein. OC|processor disambiguates based on the term environment and the presence of related concepts, assigning the annotation/knowledge domain with the highest confidence value. The precision and recall of OC|processor has been detailed elsewhere.20 For this study, the growing bodies of open access document repository CORE21,22 (COnnecting REpositories) and patent full text documents in Google Patents23 were selected to demonstrate the automated capability of identifying and analyzing scientific entities, applied to the case study of potential PFAS in documents. OC|processor18 was used to automatically identify and extract mentions of chemical compounds from patents and other open access scientific documents such as scientific articles and university documents in CORE. The resulting collection of diverse chemical compounds was subsequently filtered for small molecule compounds for which a unique InChI24 could be generated, thus removing incompletely-defined structures such as substances, polymers as well as mentions of chemical class terms and Markush-like17 structures. Of the three definitions presented above, definition B was used for most of the detailed investigations in this study. The final PFAS lists are available for all 3 definition versions described above and have been made public, together with additional results, in various formats10,25 (see also data availability) for general assessment and as input for future studies.
963
421 de-duplicated documents were selected and downloaded from the CORE document set of open access documents.22 These documents, when annotated with OC|processor, resulted in the annotation of 818
280 compounds with an unique InChI.31 The SMILES extracted from CORE are from the text only, images were not extracted.
730
728 Google Patent documents semantically annotated with OC|processor in May 2021 using both the text and images found in these patents was used. The resulting annotations are available in a BigQuery table32 dated May 13, 2021 (see Big Query32patents-public-data in the google_patents_research dataset and table annotations_202105). In total, 51
928
230
588 annotations were found. Of those, 4
533
988
229 were compound annotations with associated SMILES and InChI. Of these 4.5 billion annotations, 18
032
261 had an unique InChI33 and respective Ontology Concept IDentifier (OCID)34 in the SciWalker-Open-Data project.35 As a next (pre-filtering) step, the 18
032
261 unique compounds from the chemistry annotations of patents were reduced to a dataset of 4
182
712 SMILES that contained an “F” character, resulting from a fluorine, iron or francium atom.
The quality of the chemistry-related annotations from the combined text and image patent data is lower than from the CORE set. Optical structure recognition and extraction from images often leads to erroneous structures such as compounds containing hypervalent atoms or wrong isotopes that arise from poor image quality.
• RDKit: with the two available standardizers – molVS40,41 and rdMol.
• CDK: via SMILES parsing, normalizing the SMILES with the kekulize option.
• OCL: via SMILES parsing and MoleculeStandardizer, writing the SMILES in a kekulized form.
After parsing the input SMILES, the resulting molecule object was again represented as SMILES as an intermediate step before parsing it again and performing the substructure search to classify it as a PFAS or non-PFAS. This procedure has an effect on the parsing results as described below; in a production environment this additional SMILES generation step would probably not be performed.
The SMILES query definitions C(F)(F), C(F)(F)C(F) and C(*)(F)(F)C(F)(*)(*) were used to perform the substructure search to define the number of unique PFAS compounds.
280 compound CORE dataset
| Toolkit | Normalizer | PFAS definition B: no normalization | PFAS definition B: with normalization | ||||
|---|---|---|---|---|---|---|---|
| True | False | Invalid | True | False | Invalid | ||
| CDK | Built-in | 4163 | 801 624 |
12 493 |
4192 | 814 081 |
7 |
| OCL | Standardizer | 4192 | 813 829 |
259 | 4192 | 813 834 |
254 |
| RDKit | molVS | 4191 | 813 463 |
626 | 4191 | 813 462 |
627 |
| RDKit | rdMol | 4191 | 813 463 |
626 | 4191 | 813 090 |
999 |
The number of molecules rejected by parsing the SMILES with the different toolkits is quite different. A rejected SMILES cannot be used for subsequent substructure search, potentially reducing the number of identified PFAS molecules. Thus, the quality of the different SMILES parsers was checked by first parsing the input SMILES, then generating the corresponding InChI from the molecule object. In a second step, a normalized SMILES was written from the molecule object, parsed again and the InChI of these “reparsed” SMILES was calculated. Discrepancies between the InChIs from step one and step two in this procedure reveal issues in the quality of the parsing.
COH or HNC(
O) are incorrect representations of [NH]C(
O)). Since each chemistry toolkit uses somewhat different rules to normalize SMILES, this has an effect on the outcomes on the PFAS substructure search described below. Some normalization tasks may also be performed by specific “standardizer” modules of the toolkits that use rules (with varying degrees of available documentation) to transform SMILES into a normalized form.
In general, the number of SMILES that are not accepted by the different toolkits as valid SMILES are quite different (see “Invalid” entries in Table 1) and also depend on whether or not normalization is used. CDK seems to be more “forgiving” than RDKit and OCL, but only if normalization is used.
Of the 7 SMILES in CDK that are characterized as invalid SMILES representations with normalization, 6 are ferrocenes with coordinative bonds, such as [Fe].Cc1ccc(C)c1.Cc1ccc(C)c1 (OCID190071023137, see Fig. 2C). A meaningful ferrocene SMILES should have an iron with 2 positive charges and two cyclopentadienes with a negative charge like for example [Fe++].CC1=CC=C(C)[C–]1.CC2=CC=C(C)[C–]2 (see Fig. 2D), however this “correct” SMILES does not truly reflect the aromatic structure with a distributed negative charge and its coordinative bonding nature. This problem will be seen for all coordinative compounds, as the current SMILES syntax does not allow for coordinative or hydrogen bonds like they are available in the MDL MOL file version V3000 definitions.51 This is a serious deficiency of the current SMILES notation, excluding most metal complexes from the universe of SMILES and InChI descriptions, and is a topic under discussion within the InChI committee and IUPAC. The 7th invalid SMILES was generated by OSRA, with the hypervalent carbon atoms as shown and discussed in Fig. 3A above (OCID190014261931).
For the 254 SMILES that were found to be invalid SMILES representations by OCL with normalization, all 254 contained an aromatic selenium atom “[se]” in a kekulized, non-aromatic SMILES string. In our opinion, this behaviour is correct, as there is no such thing as a single aromatic atom in a non-aromatic environment. However, this [se] is corrected to [Se] by the other toolkits at the normalization stage. In addition, the non-normalized OCL version finds 259 invalid SMILES – the 254 are as for the normalized OCL, while these 5 additional SMILES include atoms with excessive charges such as [As+8], [As+9], [O+8], [O+9], [I+9], which are corrected to their uncharged forms by the normalizer – a behaviour which likely undesirable. The invalid SMILES for CDK (7) and OCL (254) with normalization are the result of the initial SMILES parsing. The invalid SMILES from RDKit were not investigated further, however, these are provided in ref. 10. for further inspection. It is interesting to note that the number of PFAS compounds does not change when using OCL or RDKit, irrespective of whether normalization is applied or not. However, CDK clearly needs a structure normalization before performing substructure searching.
280 CORE compound dataset.
| SSS | Standardizer | True | False | Invalid |
|---|---|---|---|---|
| CDK | CDK normalizer | 4192 |
814 081
|
7 |
| CDK | OCL standardizer | 4192 | 813 834 |
256 |
| CDK | RDKit standardizer molVS | 3018 | 266 657 |
548 605 |
| CDK | RDKit standardizer rdMol | 3018 | 266 862 |
548 400 |
| OCL | OCL standardizer | 4192 |
813 834
|
254 |
| OCL | CDK normalizer | 4192 | 814 072 |
16 |
| OCL | RDKit standardizer molVS | 4191 | 813 220 |
869 |
| OCL | RDKit standardizer rdMol | 4191 | 813 220 |
869 |
| RDKit | RDKit standardizer molVS | 4191 |
813 462
|
627 |
| RDKit | RDKit standardizer rdMol | 4191 |
813 090
|
999 |
| RDKit | OCL standardizer | 4191 | 813 051 |
1038 |
| RDKit | CDK normalizer | 4191 | 813 453 |
636 |
For the CDK, while the combination of RDKit normalization and CDK substructure search does not appear to work well together, the CDK substructure search works well with its own CDK as well as with OCL normalization. For the OCL results, it is interesting to note that the syntactically wrong SMILES with aromatic selenium mentioned above are corrected to non-aromatic by CDK, therefore reducing the number of invalid SMILES for the CDK + OCL combination. For the RDKit results, while the number of identified PFAS molecules was not influenced by the normalization used, the least invalid SMILES were found when using RDKit for both normalization and substructure search. Since the molVS model from RDKit returned fewer invalid entries but the same number of PFAS, this was used subsequently. Not surprisingly, Table 2 shows that it seems to be meaningful to take normalization and substructure search from the same toolkit.
182
712 SMILES from the patent extraction was investigated. The results of normalization and PFAS substructure search using the CDK, OCL and RDKit toolkits are shown in Table 3.
182
712 patent compound dataset using CDK, OCK and RDKit with PFAS definition B
| SSS | Standardizer | True | False | Invalid |
|---|---|---|---|---|
| CDK | CDK normalizer | 78 412 |
4 104 264 |
36 |
| OCL | OCL standardizer | 78 411 |
4 104 038 |
263 |
| RDKit | molVS | 75 762 |
3 988 584 |
118 366 |
Inspecting the invalid 36 SMILES obtained for the CDK results revealed that all structures are ferrocene type compounds as already observed with the CORE dataset. Of the 263 invalid OCL SMILES, 237 were the already known problematic aromatic selenium compounds within a non-aromatic SMILES, 25 had problems with the assignment of aromatic bonds, while one SMILES contained an incorrect nitrogen notation “[N-13]”. Again, it is interesting to note that the results from OCL and CDK are very close to each other. The invalid RDKit SMILES were too numerous for (detailed) further inspection, but are provided in ref. 10.
It is not the goal of this work to qualify and compare different fingerprint algorithms, since the described substructure search results were obtained with an ABAS on all compounds of interest (not only on a subset), as accurate results were the prime interest and search time was not an issue. However, a combined compound normalization + fingerprinting + substructure search process was also used to identify PFAS compounds from the extracted structures, as this method would probably be used in the future by typical chemistry database users to identify PFAS compounds. Table 4 shows the effect of fingerprint screening in substructure search for PFAS definitions A, B and C across the two compound datasets (CORE and Patents). It is interesting to note that the combined use of fingerprint selection and subsequent substructure search on the selected list resulted in quite comparable results for all the chemistry toolkits when using the higher quality CORE dataset. The number of identified PFAS is the same for CDK and OCL, slightly lower for RDKit. The CDK fingerprint selection appears to be more efficient than using the OCL or RDKit fingerprints for PFAS definition A and B. For the more strict definition C, OCL fingerprints are most selective. Not surprising is the lower number of identified PFAS for the more heterogeneous patent SMILES dataset, since more molecules are sorted out by the RDKit parser as shown in Table 4.
PFAS hits from the 818 280 compound (CORE) dataset |
PFAS hits from the 4 182 712 compound (patent) dataset |
|||
|---|---|---|---|---|
| FP | FP + ABAS | FP | FP + ABAS | |
| Definition A | ||||
| OCL | 58 132 |
27 287 |
4 044 452 |
1 844 193 |
| CDK | 45 632 |
27 287 |
2 658 045 |
1 844 254 |
| RDKit | 300 848 |
27 282 |
4 047 047 |
1 792 598 |
![]() |
||||
| Definition B | ||||
| OCL | 23 830 |
4192 | 2 225 142 |
78 411 |
| CDK | 16 922 |
4192 | 1 335 409 |
78 412 |
| RDKit | 299 969 |
4191 | 4 041 432 |
75 762 |
![]() |
||||
| Definition C | ||||
| OCL | 9043 | 3507 | 472 731 |
62 553 |
| CDK | 16 922 |
3507 | 1 335 409 |
62 561 |
| RDKit | 215 514 |
3502 | 3 502 138 |
60 426 |
The results of PFAS selection with the combined use of fingerprints and subsequent ABAS selection correspond exactly to the results when using ABAS on all input molecules – with one exception of CDK for definition A where the direct ABAS search finds one structure in addition to the fingerprint + ABAS process, which is OCID190080191030 (PubChem CID 117959248) with a very extensive polycyclic aromatic structure, shown in Fig. 2E.
• Parsing the input SMILES and eliminating erroneous wrong compound structures with hypervalent atoms or wrong isotopes
• Calculating the standard InChI of the input SMILES (“InChI-1”)
• Standardizing the parsed SMILES molecule object, writing a standardized SMILES and calculating the standard InChI of the standardized SMILES (“InChI-2”)
• De-duplicating structures based on “InChI-2”
• Running a ABAS substructure query on the standardized SMILES for PFAS definition A, B and C.
In the CORE set 974 structures were found with a wrong SMILES and 25
627 structures with a changed InChI after normalization using OCL – these were removed from the datasets. In the patent set, 108
492 structures had incorrect SMILES and 81
272 structures had a changed InChI after normalization with OCL.
The results of the normalized structures classified as PFAS are shown in Table 5 and compared with the existing PFASMASTER and OECDPFAS lists (mentioned in the introduction) by InChIKey. The number of entries missing from PubChem was determined by matching InChIKeys in each PFAS dataset and the OCID-PubChem dataset in sciwalker: sciwalker-open-data.chemistry_compounds.ocid_pubchem_cid.
| Total | Not found in PFASMASTER (10 782 InChI) |
Found in PFASMASTER (10 782 InChI) |
Found in OECDPFAS (3741 InChI) | Not found in PubChema | |
|---|---|---|---|---|---|
| a Prior to deposition of the entire dataset to PubChem, to fill these gaps. | |||||
| CORE definition A | 27 058 |
25 446 |
1612 (1686 IKFB) | 944 (988 IKFB) | 7119 |
| CORE definition B | 4139 | 2652 | 1487 | 939 | 1175 |
| CORE definition C | 3457 | 2095 | 1362 | 931 | 915 |
| Patents definition A | 1 783 651 |
1 780 041 |
3610 | 1529 | 216 777 |
| Patents definition B | 75 108 |
71 818 |
3290 | 1520 | 10 809 |
| Patents definition C | 34 197 |
32 564 |
1633 | 847 | 4882 |
The overlap of the PFAS in the CORE and patent datasets for the different definitions were (A) 12
876; (B) 1806; and (C) 866 PFAS entries, showing that the extraction of data from different sources reveals highly complementary results.
The overlaps between the lists extracted here and the existing PFAS lists were much lower than expected. Likewise more entries were missing from PubChem than originally expected, especially for the CORE database. The results were reality checked – here documented with an example for the CORE set using the stringent definition C (915 compounds not in PubChem). One of these 915 compounds includes OCID190080091261 (InChIKey LZICQIXBOVBGMV-UHFFFAOYSA-N), shown in Fig. 4. This was published in a PhD thesis52 in Chemistry and extracted from the document section IV. Experimental part 240 16.8.2 via name to structure from “Trimethyl({4′-[(7,7,8,8,9,9,10,10,11,11,12,12,12-tridecafluorododecyl)oxy]-1,1′-biphenyl-4-yl}ethynyl)silane”, which has been interpreted correctly. This shows the potential for literature mining to capture structures that are real and worthy of further investigation, but not yet known to PFAS researchers or to large open databases such as PubChem.
![]() | ||
| Fig. 4 A PFAS classified compound (all definitions) that was indexed in a CORE publication but is not in PubChem (OCID190080091261). | ||
To enhance the discovery of these PFAS in environmental samples, both datasets have been made available as CSV files25 for use in mass spectrometry-based screening approaches, such as MetFrag53 and patRoon.54 Two separate files have been created, for the CORE and patent datasets respectively – with each entry tagged according to the PFAS definition that the given structure satisfies. The CORE dataset additionally includes the number of references in which the structure was found, which can be used for prioritization of candidate matches. The files were formatted as a MetFrag localCSV, where all entries that cause MetFrag to fail (formulas with digits preceding the carbon; certain unusual elements as removed in PubChemLite55) were removed. Where available, names and CIDs were filled in via PubChem, otherwise the OCID was assigned as a name. The resulting files contained 26
695 entries for CORE (of which 5903 entries are without CIDs and 363 entries were removed from the original CORE list) and 1
778
470 entries for patents (of which 85
277 are without CIDs and 5181 entries were removed). The number of PubChem CIDs is higher than above due to the different style of querying; here a combination of FTP files (InChIKey to CID mapping) and REST API (SMILES to CID mapping for remaining entries without CIDs) was used, as the REST API offers the SMILES standardization to match with the final version in PubChem. For the original lists, 5937 CIDs were missing in the CORE set of 27
058 SMILES (21.9%), while 85
472 CIDs were missing in the patents set of 1
783
651 SMILES (4.8%). The ratio of missing CIDs was very similar in the final MetFrag files. Both datasets were deposited to PubChem (Feb. 12, 2022, submissions 112
615 and 112
624) to fill these gaps and the CID mappings were updated on April 20, 2022 to include these new CIDs. The MetFrag CSV files are available on Zenodo25 for use in all mass spectrometry workflows, and are also available in the dropdown menu of the MetFrag Web interface (https://msbi.ipb-halle.de/MetFrag/).
058 InChIKeys was uploaded to the PubChem ID Exchange,57 which returned 20
907 matches via Entrez History. This was then used to browse the NORMAN SLE Classification tree in PubChem.56 Since the influence of searching via InChIKey first block (structural skeleton) versus full InChIKey was not dramatic (only an additional 44 entries found, see row 1 of Table 5), this analysis was kept at the InChIKey level for consistency with the rest of this article. The OECDPFAS list is split into many categories; of primary interest for data extraction is the “Structure Category”, which covers 8 major PFAS categories (denoted 100 through 800), with several subcategories in each. The major categories and the number of matches in CORE are shown in Table 6.
| OECD structure category | Total | In CORE | Ratio |
|---|---|---|---|
| a Neither mapping captures polymers, due to use of InChIKeys. PFAA = perfluoroalkyl acids. | |||
| S25|OECDPFAS|list of PFAS from the OECD | 3677 | 940 | 26% |
| 100 perfluoroalkyl carbonyl compounds | 490 | 126 | 26% |
| 200 perfluoroalkane sulfonyl compounds | 458 | 193 | 42% |
| 300 perfluoroalkyl phosphate compounds | 16 | 7 | 44% |
| 400 fluorotelomer-related compounds | 1392 | 350 | 25% |
| 500 per- and polyfluoroalkyl ether-based compounds | 322 | 52 | 16% |
| 600 other PFAA precursors or related – perfluoroalkyl | 282 | 129 | 46% |
| 700 other PFAA precursors or related – semifluorinated | 716 | 83 | 12% |
| 800 fluoropolymersa | 1 | 0 | 0% |
Table 6 shows that PFAS in the categories 200, 300 and 600 are found quite well in the CORE documents (approx. 40% coverage). In contrast, categories 500 (per- and polyfluoroalkyl ether-based compounds) and 700 (semifluorinated perfluoroalkyl acid (PFAA) precursors), are underrepresented (16 and 12%, respectively). Even within categories, different subcategories were underrepresented, for instance very few entries were found from subcategory 103 “other perfluoroalkyl carbonyl-based nonpolymers” (only 13 of 168 entries in OECDPFAS, i.e. 8%). Likewise, only 3 of 127 (2%) of subcategory 701.2 “Semi-fluorinated alkanes (SFAs) and derivatives (n ≥ 4)” were found, and only 26 of 405 (6%) of 705 “side-chain fluorinated aromatics”. It would be interesting future work to investigate whether the CORE and patent datasets could capture additional knowledge to add more PFAS to these categories, for instance by expanding the “splitPFAS” work at categorizing PFAS58 (prototyped so far on only 4 of the OECDPFAS categories) for this context.
The resulting PFAS lists have been compared with two of the largest publicly available lists of PFAS molecules, PFASMASTER from the US EPA and the OECDPFAS list, released by the OECD. The overlap between the lists and the data extracted from scientific documents and patents is lower than expected, showing that many molecules on these lists are not found in the scientific documents and patents investigated, while also many molecules from the document extraction are not found in the published PFAS lists. Several thousand were also not in PubChem, but have since been deposited. The CORE and Patents datasets have been provided as CSV files on Zenodo25 for mass spectral screening. This information will add to the number of known potential PFAS substances and hopefully help contribute to alleviating the “PFAS knowledge gap”. The provision of public datasets will allow the integration of this information into various non-target mass spectrometry workflows, such as the open workflows MetFrag53 and patRoon,54 thus enabling other researchers to investigate the potential occurrence of the identified PFAS compounds in humans and the environment in future studies.
Footnote |
| † Details on the supporting information are summarised in the Data availability section. |
| This journal is © The Royal Society of Chemistry 2022 |