Environmental Science Processes & Impacts

Per-and poly ﬂ uoroalkyl substances (PFASs) are a large and diverse class of chemicals of great interest due to their wide commercial applicability, as well as increasing public concern regarding their adverse impacts. A common terminology for PFASs was recommended in 2011, including broad categorization and detailed naming for many PFASs with rather simple molecular structures. Recent advancements in chemical analysis have enabled identi ﬁ cation of a wide variety of PFASs that are not covered by this common terminology. The resulting inconsistency in categorizing and naming of PFASs is preventing e ﬃ cient assimilation of reported information. This article explores how a combination of expert knowledge and cheminformatics approaches could help address this challenge in a systematic manner. First, the “ splitPFAS ” approach was developed to systematically subdivide PFASs (for eventual categorization) following a C n F 2 n +1 – X – R pattern into their various parts, with a particular focus on 4 PFAS categories where X is CO, SO 2 , CH 2 and CH 2 CH 2 . Then, the open, ontology-based “ ClassyFire ” approach was tested for potential applicability to categorizing and naming PFASs using ﬁ ve scenarios of original and simpli ﬁ ed structures based on the “ splitPFAS ” output. This work ﬂ ow was applied to a set of 770 PFASs from the latest OECD PFAS list. While splitPFAS categorized PFASs as intended, the ClassyFire results were mixed. These results reveal that open cheminformatics approaches have the potential to assist in categorizing PFASs in a consistent manner, while much development is needed for future systematic naming of PFASs. The “ splitPFAS ” tool and related code are publicly available, and include options to extend this proof-of-concept to encompass further PFASs in the future.


Introduction
Per-and polyuoroalkyl substances (PFASs), as currently dened under the OECD/UNEP Global PFC Group, are organic chemicals containing at least one peruorinated carbon moiety, 1 i.e., -CF 2 -. PFASs may exhibit a number of desirable chemical properties, such as high resistance to heat and chemical reactions, as well as hydrophobicity and oleophobicity, in comparison to their non-uorinated analogues. 2 Therefore, since the 1940s, large numbers of PFASs with diverse functional groups and properties have been developed and used widely in numerous industrial and consumer applications. [2][3][4][5][6][7] Since the late 1990s, there has been mounting scientic evidence of the human health risks of many PFASs, 8 mirrored with mounting concern in policymakers and the general public. In particular, peruorooctanesulfonic acid (PFOS), its salts and peruorooctane sulfonyl uoride, and peruorooctanoic acid (PFOA), its salts and PFOA-related compounds, were listed under the UN Stockholm Convention on Persistent Organic Pollutant in 2009 and 2019 for a global phase-out, respectively, and peruorohexanesulfonic acid (PFHxS), its salts and PFHxSrelated compounds are being evaluated for listing under the Stockholm Convention.
To date, most studies on the occurrence and effects of PFASs have focused on a limited set of PFASs, namely peruoroalkyl acids (PFAAs), and several PFAA precursors derived from per-uoroalkane sulfonyl uorides (i.e., PASF-based compounds) as well as peruoroalkyl iodides (i.e., n:2 uorotelomer-based compounds, n:2 FTs), 8 see Fig. 1. For the latter, the most commonly studied compounds are those with relatively simple molecular structures, e.g. peruoroalkane sulfonamides/amidoethanols (FASAs/FASEs) and uorotelomer alcohols/ sulfonic acids (FTOHs/FTSAs). 8 Fig. 1 provides an overview of these major PFAS groups and either generic composition information, or specic examples. Additionally, several lists with specic and generic structures are already available online (see ref. [9][10][11][12][13][14]. The main research focus on PFAAs and PFAA precursors with simple molecular structures is due to two main reasons: (1) analytically, they are relatively easier to measure than other PFASs with more complex molecular structures; and (2) analytical standards are generally commercially available. It has been challenging to expand beyond this domain as the chemical composition (let alone analytical reference standards) of most remaining commercial products are not known in the public domain. However, with the increasing accessibility of high resolution mass spectrometry and advancement of non-target screening techniques, as well as increasing exchange of chemical information between authorities and scientists, these factors are becoming less of a barrier for identifying overlooked and unknown PFASs, which can include unreacted reactant residuals and degradation intermediates present in products and in the environment. This has been repeatedly observed in the many recent "non-target" studies on the PFAS-containing aqueous re-ghting foams and their contaminated sites, [15][16][17] Fig. 1 An overview of PFASs (adopted from the OECD report 1 with the addition of perfluoroalkanoyl fluorides (PACFs), n:1 fluorotelomer alcohols and their derivatives, highlighted in light blue). Interactive lists with structures and/or generic representations, are available online. 9,10 as well as recent reports outlining the extent of PFASs (and other chemicals) in higher order food chain animals such as polar bears 18 and near manufacturing plants. [19][20][21][22] The number of "non-target" studies on PFASs has greatly increased in the past several years and has been reviewed recently. 15 Due to the diverse and oen complex molecular structures of different PFASs, it may oen be challenging to categorize newly identied PFASs in a consistent and coherent manner, particularly for non-technical experts and those who are not familiar with PFASs. The >4700 Chemical Abstracts Service Registry Numbers (CAS_RN) identied in the OECD PFAS list were manually assigned by the same person to certain structure categories. However, such manual categorization efforts cannot be easily reproduced by others due to the high level of expertise required, possible different interpretations of structural traits, and the potential for human errors including oversights and typing errors.
Furthermore, the current development of PFAS terminologies lags behind the rapid development and application of "non-target" screening techniques, particularly for PFASs without a given CAS_RN. As such, the authors of individual studies have oen created their own naming conventions (including acronyms) for newly identied PFASs. This leads to the generation of a lot of parallel and oen non-intuitive acronyms, potentially prohibiting effective communication among scientists themselves and with other stakeholders, creating barriers for synthesizing knowledge. For instance, "1,1,2,2-tet-rahydroperuorodecanol", "2-(peruorooctyl)ethanol", "8:2 FTOH", "8:2 uorotelomer alcohol", and "PFA 8" are a few of >36 synonyms registered for one single structure (CAS_RN 678-39-7 (ref. 16)). This is not an issue for PFAS studies alone, but is exacerbated for these substances due to high public and scientic interest, as well as the increasing advancement and application of "non-target" studies. 15 Some non-target studies 17,18 are now using the information included in publically available suspect lists, via e.g. the NORMAN Suspect List Exchange 10 and the CompTox Chemicals Dashboard 19,20 in their identication efforts. In addition, several groups are investing efforts into naming and categorisation of PFASs. For instance, the US EPA are experimenting with the incorporation of expert knowledge and cheminformatics approaches developed in house, 21 recently offering some perspectives on how to name certain groups of PFASs, while Barzen-Hansen et al. 22 used a simplied, manual IUPAC-based naming system for the PFASs that they identied in their nontarget screening, detailed in the ESI of that publication (pages S6-S7; Table S3 pages S15-S21). † Recently, an open access approach, ClassyFire, 23 was developed to categorize chemicals systematically into a formal chemical ontology. ClassyFire uses chemical structures and structural features to automatically assign chemicals to a pre-dened taxonomy consisting of up to 11 levels (termed kingdom, superclass, class, subclass, etc.). ClassyFire has been used to annotate over 77 million compounds, 23 and the results can be looked up with InChIKeys (the hashed version of the full International Chemical Identier, InChI) 24 . Only a few very wellknown PFASs were in the dataset used to train ClassyFire, primarily those entries that are in DrugBank 25 or T3DB. 26 However, new calculations can be performed using structural information provided as Simplied Molecular Input Line Entry System (SMILES), 27 InChIs or even the International Union of Pure and Applied Chemistry (IUPAC) name. Results and calculations are available via a freely accessible web server 28 at http:// classyre.wishartlab.com.
This background motivates the current study to investigate possible additional automated, open approaches that combine background (expert) knowledge, existing PFAS naming conventions, and cheminformatics to systematically categorize PFASs, particularly in a non-target screening context. In brief, this study consists of two main components: (1) development and testing of a structure manipulation tool, splitPFAS, using simple SMILES 27 and the related SMiles ARbitrary Target Spec-ication (SMARTS) 29 annotations (explained below) to identify PFASs based on pre-dened structural traits; and (2) investigation of the potential to use the combination of splitPFAS and the ontology-based ClassyFire. 23 More specically, this study focuses on four groups of PFAA precursors: PACF-and PASF-as well as n:1 and n:2 uorotelomer-based compounds (see Fig. 1) as test subjects (using discrete structures present in the recent OECD PFAS list1). This is because a common terminology for some PFASs in these four groups has been recommended in Buck et al. 4 and thus can be used as a reference point to validate the approach. While there are also many other groups of PFASs of interest, e.g. peruoroether-based substances, 1 these were not considered as part of this study, as no commonly used basic rules exist for characterizing, categorizing and naming these structures yet. As there is an ongoing international effort under the leadership of the OECD/UNEP Global PFC Group to establish some harmonized basic rules for these groups of PFASs, 30 it is the intention that the approach presented here can be expanded to cover these cases, once this additional information is available in the near future.

Methods
This work consisted of three major cheminformatics steps (see Fig. 2), described in detail below. First, the "splitPFAS" method was developed and used to identify whether a given PFAS was within the four PFAS categories of interest in this study. Second, the structures of the PFASs matching these patterns were manipulated according to dened rules and scenarios. The resulting modied structures were used as input for ClassyFire in the third step. The ClassyFire results were then compared with the common terminology recommended by Buck et al., 4 discussed in the Results section.
*As known commercial n:1 uorotelomer-based compounds are not derived from the telomerization process, but rather from the reduction of peruoroalkyl carboxylic acids, 3 they are not, strictly speaking, uorotelomers. Despite this, they are termed "n:1 FT-based compounds" here for readability, since the pattern of the peruorocarbon:hydrocarbon chain is the same (i.e., n:1 vs. n:2).
These groups display systematic patterns. The PACF derivatives can be represented with the generic formula C n F 2n+1 -CO-R, the PASF derivatives as C n F 2n+1 -SO 2 -R, and the n:1/n:2 FTs as C n F 2n+1 -CH 2 -R/C n F 2n+1 -CH 2 CH 2 -R. Some example PASFs (top row, (a)-(c)) and FTs (bottom row, (d)-(f)) are given in Fig. 3 below, with the "R" group highlighted in green. The corresponding names, CAS_RN, and SMILES (Simplied Molecular Input Line Entry System) code 30 of the R group, shown in blue as "R SMILES ", are given in the caption.

SMILES and SMARTS-based manipulations with splitPFAS
The systematic patterns, visible from the structures shown in Fig. 3 and the generic formulas given above, render these substances suitable for basic cheminformatics manipulations based on SMARTS, 29 an extension of SMILES able to specify substructures via e.g. wildcard atoms and logical operators (see Reference for further details). In fact, the green highlighting in Fig. 3 is performed using SMARTS functionality in the chemical drawing soware used here, CDK Depict. 31,32 These systematic patterns mean that it would be possible to split the molecule into two parts, the peruoroalkyl part (C n F 2n+1 ) and the R group, using a SMARTS-based recognition of the alpha carbon on the PFAS chain and the "dividing group" (which we will term "X" in this manuscript). In other words, using the test subjects C n F 2n+1 -CO-R, C n F 2n+1 -SO 2 -R, C n F 2n+1 -CH 2 -R, and C n F 2n+1 -CH 2 CH 2 -R as examples, all substances satisfy the pattern C n F 2n+1 -X-R where X is CO, SO 2 , CH 2 or CH 2 CH 2 . Using this information, it is possible to come up with some simple SMARTS codes to catch these cases: As SMARTS can be inherently tricky for users not intimately acquainted with SMILES, let alone SMARTS notation, "splitP-FAS", a program written in Java using the Chemistry Development Kit (CDK) 32 was created to implement this SMARTS-based pattern search with a simple input le that requires only the SMILES/SMARTS of the dividing group "X", along with several options controlling the output. The SMARTS codes above can be interpreted as follows: (]O) refers to a double bonded oxygen, [CH2] species a carbon with exactly 2 hydrogens attached. FC(F)([C,F]) species a CF2 attached to either another F or C, i.e., this detects the "alpha" carbon of the peruorinated chain, while [!$(C(F)(F)); !$(F)] means that X (the SMARTS code in bold above) is not adjacent to a CF2 group or an F and thus identies the R part of C n F 2n+1 -X-R.
The SMARTS detecting the PFAS alpha carbon (both parts of the non-bolded SMARTS code above) can be adjusted by advanced users via the optional input "pacs" (PFAS alpha carbon SMARTS). The "splitPFAS" approach was integrated into the "MetFragTools" suite (current version 2.4.5 (ref. 33)), with source code and documentation available on GitHub. 34 Accompanying R scripts and functions are documented and available for use via the RChemMass package in GitHub 35 and as part of the ESI, † along with user instructions on how to use splitPFAS. The SMARTS implemented by default in the current version were designed to handle the case studies in the proofof-concept approach described here, i.e., focusing on saturated, linear isomers of the peruoroalkyl part (C n F 2n+1 ). Other forms of the peruoroalkyl part (e.g., unsaturated and/or branched or cyclic isomers) can be captured (e.g., in future studies) by adjusting the SMARTS with the "pacs" option described above.
The order of the SMARTS in the splitPFAS input le (example available online) 36 is important, as it determines the processing order of the list of PFASs. For instance, the order used here is: [CH2] such that rst the pattern for PACF derivatives is searched, then PASF derivatives, then n:2 uorotelomers, then n:1 uorotelomers. Should the pattern be found, the molecule is "split" into the respective parts (PFAS-part, X and R), otherwise the next pattern is attempted, and so on. If the le includes an empty line at the end, molecules that full the C n F 2n+1 -R are also split (i.e., the case where there is no dividing group "X"). The output of splitPFAS includes the SMILES of "X", "C n F 2n+1 -X" and the R group (separated by "|" if more than one), as well as the number of PFAS parts and an error message if the splitting failed. Further documentation of splitPFAS is available in the ESI † and from the GitHub site, 34 while more examples and details are given in the results section below (Section 3.1).

Calculation with ClassyFire using different scenarios
As mentioned in the introduction, ClassyFire uses chemical structures and structural features to automatically categorize chemicals into a specially designed ontology. Pre-calculated results, as well as results from new calculations can be accessed via a freely accessible web server at http:// classyre.wishartlab.com. In this study, the web server was accessed using InChIKeys (to retrieve pre-calculated results) and SMILES (for new calculations) from the OECD PFAS list via R. The script is available in the ESI. The ClassyFire workow contains four steps: (1) preprocessing of the chemical entity; (2) feature extraction; (3) rulebased category assignment and category reduction; and (4) selection of the direct parent. 32 Briey, the categorization starts with the calculation of the physico-chemical (e.g. mass and pK a ) and structural properties (e.g. number of aromatic or aliphatic rings) of the query compound. Then, a list of structural features is generated based on a combination of property calculations and superstructure search, which is performed on a built-in library of over 9000 manually designed SMARTS patterns and Markush structures. 23 Each feature in the list is then assigned to a category in the taxonomy according to a manually compiled dictionary, which contains the weighting and category of each feature. Aer that, a non-redundant list of chemical categories is constructed and the category of the largest structural feature that describes the compound is selected as the direct parent. However, when the largest structural feature is less informative in describing the compound, the category of the most descriptive feature is dened as the direct parent. Such cases are handled by a manually compiled set of exceptions in ClassyFire. In ClassyFire, the taxonomy categories are dened by unambiguous, computable structural rules, and are named using a consensus-based nomenclature. In this study, four outputs from ClassyFire (superclass, class, subclass and direct parent) were evaluated for their potential to be used in systematic categorization and naming of PFASs by comparing with the common terminology recommended by Buck et al. 4 To explore how different structures may inuence the ClassyFire results, especially as ClassyFire was not developed with PFASs in mind, the PFASs of interest (i.e., PACF derivatives, PASF derivatives, and n:1/n:2 FTs) were manipulated using splitPFAS into ve scenarios. To start, the SMILES of the structure C n F 2n+1 -X-R, was split into the uorinated (C n F 2n+1 ), dividing group (X), and non-uorinated functional group (R) parts using splitPFAS. These were then used in various combinations, with each scenario documented below in terms of the pattern C n F 2n+1 -X-R. The SMILES codes of the structures resulting from the following scenarios were then taken as inputs for ClassyFire. The scenarios were: (i) C n F 2n+1 -X-R The structure was not modied; (ii) C n H 2n+1 -X-R The structure was converted into a non-uorinated analogue (i.e., replacing F with H in the PFAS part); (iii) H 3 C-X-R The uorinated part was discarded and a methyl added to X, which was re-combined with R to form H 3 C-X-R and thus compensated for the missing PFAS chain; (iv) X-R As in scenario (iii), but only the SMILES of X-R; (v) R As in scenario (iv), but only the SMILES of R. The rationale behind these scenarios is as follows. Scenario (i) formed the base case; ideally this case would yield the desired categorization results, but as ClassyFire was not trained on many PFASs, this was not expected initially in all cases. Scenario (ii) was created to determine whether, instead, ClassyFire could generate sufficiently informative results on the analogous non-uorinated structure (as alkyl chains are generally far more prevalent than peruoroalkyl chains). To remove the inuence of the peruorinated carbon chain on the results entirely, scenario (iv) was conceived. This initially generated many errors that could be resolved by adding a methyl group; this became scenario (iii). An additional concern with scenario (iv), which was easier to implement than scenario (iii), was that the replacement of a (peruoro)alkyl chain with a sole hydrogen (a result of SMILES manipulation) could lead to miscategorization of the functional group (e.g. an ether becomes an alcohol). Since splitPFAS could actually already separate the peruorinated part and the functional group "X", nally scenario (v), containing only the R group, was used as the simplest case to assess the potential of ClassyFire for categorization.
Several examples of scenarios (i) to (iii) are shown in Fig. 4, giving one selected compound for each major case (i.e. PASF, PACF, n:1 FT, n:2 FT). The "X" group is shown in green; thus the X-R and R groups in scenarios (iv) and (v) can be interpreted easily from the column showing scenario (iii). While the splitPFAS method in the Java program can handle structures that result in multiple peruorinated carbon chains or multiple non-uorinated parts aer splitting (e.g. Fig. 3(c) and (f)), these were not taken into further consideration for ClassyFire at this stage, primarily for simplicity in presenting the results at this proof-of-concept stage, but are discussed further below.

Results from splitPFAS
As mentioned in Section 2.1, the splitPFAS tool was used to split the input SMILES from the selected OECD PFASs following the "C n F 2n+1 -X-R" pattern according to the given SMARTS of the dividing group "X" (listed above). The output of splitPFAS includes the SMILES of "X", "C n F 2n+1 -X" and the R group (separated by "|" if more than one), as well as the number of PFAS parts and an error message if the splitting failed. Several examples are given in Table 1; the complete results are in the ESI. † As mentioned above, the order of SMARTS in the le [CH2]", respectively) match the pattern "C n F 2n+1 -X-R" and were further used as inputs in ClassyFire. The others that were correctly split using splitPFAS (73 compounds) had either two or more "C n F 2n+1 " or "R" groups and were not used as inputs in ClassyFire, primarily for simplicity at this proof-of-concept phase. As mentioned above, splitPFAS was run with the SMARTS [CH2][CH2] (for n:2 FTs) before [CH2] (for n:1 FTs) to ensure that these cases were treated correctly. The remaining 149 compounds were not correctly split using splitPFAS because their molecular structures were outside the patterns pre-dened in the current version of splitPFAS, including: (1) the peruoroalkyl chain was branched or cyclic (10 compounds), (2) the peruoroalkyl chain was unsaturated (7 compounds), (3) the uoroalkyl chain was not peruorinated (23 compounds), (4) the R group was a single F atom (15 compounds), (5) the dividing groups (X) were outside the SMARTS notation used in splitPFAS (90 compounds, see Section 2.2), and (6) a combination of the factors above (4 compounds). Details on these cases (and possible extensions to resolve them in future studies) are discussed further in Section 4 below.
In addition, the splitPFAS results were compared with the manually curated structure codes given in the latest OECD PFAS list. 1,13 In total, eleven compounds were identied as being View Article Online mislabeled in this list (one PACF was an n:1 FT, two PASFs were in fact n:2 FTs, one n:2 FTs were PASFs, and eight n:1 FTs were rather peruoroalkene derivatives). These entries (a list is provided in the ESI †) will be communicated back to the OECD/ UNEP Global PFC Group for possible revisions in the next OECD PFAS list. This demonstrates that splitPFAS has the potential to assist in categorizing PFAS automatically and detect human error, thus supporting experts in this work, which is becoming more challenging with the thousands of PFAS structures now being documented. As this OECD PFAS list was the basis for this investigation, and as CAS_RN and name are the primary identiers in this list, we refer to specic examples throughout this manuscript using the CAS_RN from this list for clarity and to allow a more compact presentation of the results below.

Results from ClassyFire
Overall, ClassyFire returned results in the vast majority of cases. Out of the 548 compounds (50 PFACs, 156 PFSCs, 142 n:1 FTs and 200 n:2 FTs), ClassyFire failed to return results in only two cases, both scenario (i) for n:2 FTs. These cases, CAS_RN 26650-09-9 and 26650-10-2 consistently returned server errors in ClassyFire (e.g. query IDs 3540761 and 3541037) and it is likely that ClassyFire cannot process these properly (both are thiocyanic acids). These cases have been reported to the developers.
The ClassyFire results for scenario (i) vary considerably across different compounds (see Tables 2-4), with a few exceptions where ClassyFire has been ne tuned to recognize certain PFASs (e.g., see the "direct parent names" of row 5 in Table 2, row 1-7 and 9 in Table 4). This suggests that the current version of ClassyFire alone is not suitable for systematic categorization of PFASs, but does have the potential to be adjusted to do so.
Considering the ClassyFire results across PFASs and the respective scenarios, the potential of using ClassyFire as a basis for PFAS naming is elaborated further below in terms of two groups: (1) n:1 and n:2 uorotelomer-based compounds, and (2) PACF and PASF derivatives.

View Article Online
Simple n:1/n:2 FT compounds. For relatively simple n:1 and n:2 FT-based compounds, ClassyFire provides similar and meaningful results (for a given compound, per scenario) for almost all ve scenarios, which could potentially be directly used as a basis for naming these PFASs. Several examples are given in Table 2. In a few cases, the output was too general to be useful in scenario (i) and (ii), indicated with red shading in Table 2. Taking CAS_RN 375-01-9 (rst row, Table 2) as an example, if results from splitPFAS (i.e., "n:1 uorotelomer") and ClassyFire (e.g. scenario (iii), sub-class name: "alcohols and polyols"; direct parent name: "primary alcohols") are combined manually, it would yield "n:1 uorotelomer alcohols", which is in line with the terminology recommended by Buck et al. 4 This applies to all other cases listed in Table 2, although not quite sufficiently precise for CAS_RN 19430-93-4 (row 7, highlighted red). Additionally, in some cases, ClassyFire has been ne tuned to recognize certain uorotelomers, such as in scenario (i), CAS_RN 755-40-8 ( Table 2, row 5), where ClassyFire directly assigned the direct parent name as "uorotelomer alcohol".
Complex n:1/n:2 FT compounds. Several examples of ClassyFire results for more complex n:1/n:2 FTs are given in Table 3. In contrast to the above, the ClassyFire results would not be a suitable basis for naming these more complex PFASs directly, as the ClassyFire results only provided information on a part of the functional group R. For example, taking CAS_RN 48077-95-8 (Table 3, row 3), the ClassyFire results (sub-class name: "acrylic acids and derivatives"; direct parent name: "acrylic acid esters") capture only the -O-C(O)-CH] CH 2moiety, but not the -N(CH 3 )CH 2 CH 2moiety. Therefore, for these cases it seems key pieces of information are missing in the ClassyFire results that would be necessary to name the PFASs correctly. While other parts of the ClassyFire output (other than sub-class name and direct parent name) were also considered, the general pattern described here holds over all output types.
PACF and PASF derivatives. In contrast to n:1 and n:2 FTs, the ClassyFire results for PACF and PASF derivatives vary considerably among scenarios (see Table 4). In general, scenario (iv) and (v) generated many non-meaningful results, particularly in the case of acids (e.g. CAS_RN 375-85-9, PFHpA, scenario (v), sub-class name: none; direct parent name: "homogeneous other non-metal compounds") and amides (e.g. CAS_RN 423-54-1, scenario (v), sub-class name: none; direct parent name: "homogeneous other non-metal compounds"). Among the other three scenarios, in scenario (i) again it is evident that ClassyFire has been ne tuned in some cases (e.g. by assigning direct parent name "per-uoroalkyl carboxylic/sulfonic acids and derivatives" to the compounds in the rst seven rows of Table 4). While these assignments are correct, they are too general for the naming of these compounds and this can in fact already be achieved with splitPFAS alone. Therefore, scenario (i) is not further recommended for these substances. Scenario (ii) and (iii) both yielded the same results in many cases, with few exceptions. Similarly to n:1 and n:2 FTs, when the molecular structures of the PACF/PASF derivatives are rather simple, the splitPFAS and ClassyFire results could potentially be combined to provide a good basis for naming the compounds. Using CAS_RN 30334-69-1 as an example, by combining the splitP-FAS ("peruoroalkane sulfonyl") and ClassyFire (direct parent name: "organosulfonamides") results, it would give "per-uoroalkane sulfonamides", which is in line with the recommendation by Buck et al. 4 In contrast, for more complex structures, the ClassyFire results again only reect part of the functional group, R (e.g. CAS_RN 34454-97-2, direct parent

Paper
Environmental Science: Processes & Impacts name: "organosulfonamides") and thus do not contain all the information necessary for naming the PFASs properly.

Combining splitPFAS and ClassyFire
In summary, splitPFAS worked as designed, and could successfully distinguish different predened patterns of PFASs and thus be used to categorize and identify PFASs of interest. The cases that were not considered in this manuscript are discussed in more detail below. In contrast, the ClassyFire results were more mixed. Among the ve scenarios examined for PFASs, scenario (iii) appears to be most reasonable for future use. For PFASs with simple molecular  structures, it seems that ClassyFire results, when combined with splitPFAS results, could potentially be a good basis for systematically naming PFASs, whereas for more complex structures, the ClassyFire results are not yet sufficient for such purpose and more extensive training or development of ClassyFire may be needed for PFASs. In the following section, these results are assessed and discussed in more detail to propose possible strategies and next steps to further improve this concept.

Overall
The results presented above indicate a few general trends, which will be discussed here with the perspective of scaling this up to future categorization/naming efforts of a greater range of PFASs. In general, splitPFAS is able to identify pre-dened PFAS patterns as designed and thus holds the potential for automated categorisation of PFASs. ClassyFire yielded interpretable results for n:1/n:2 FTs with rather simple functional groups, although the categories were sometimes a little broad, while for the more complex functional groups, the classication seems to correspond with only part of the functional group. ClassyFire also generally yielded interpretable results for the PASF/PACFbased derivatives, but for these cases the direct classication (scenario (i)) was less useful, since the splitPFAS output already takes care of the pattern that required classication. In processing the ClassyFire results, several examples appeared where compound-specic rules seem to be incorporated into Classy-Fire, for instance the n:2 uorotelomer alcohols (e.g. row 5, Table 2) and Table 4, row 1. For the latter, the sub-class name "alkyl uorides" does not make much sense in the context of the Table 3 Selected ClassyFire results (sub-class names) for n:1 and n:2 fluorotelomer-based compounds with more complex examples. Orange shading indicates an exception to the rules in splitPFAS. Entries in round brackets are the "direct parent name" structure, but the direct parent name (peruoroalkyl carboxylic acid and derivatives) is very specic. The results demonstrate that the combination of expert knowledge and cheminformatics techniques will be needed to improve the characterization, categorization and naming of PFASsif the patterns can be represented systematically in a cheminformatics format, this expert knowledge and lists of substances can be combined to form a large training set to generate PFAS-specic rules for ClassyFire, which could then be accessible to the community and thus available to research groups performing e.g. non-target screening of PFASs. This sharing of various expertise will be critical to move the eld forwards.
A logical next step to build on this work would be to expand the SMARTS denitions for the dividing group "X" to cover other major PFAS groups (i.e., those not considered in this manuscript) and to adjust the PFAS alpha carbon SMARTS, if necessary, to capture some of the (few) specialised cases that fail to split properly. These cases are discussed in more detail in Section 4.2 below. The results above show that output from splitPFAS is, at this stage, already enough to assist categorizing PFASs and in curating lists, and would potentially provide the detailed training set needed to generate a specialised set of rules for a highly customized ClassyFire for PFASs. Future work should investigate whether a resulting specialised ClassyFire-based ontology, based on splitPFAS categorization, could be used for automated naming of PFASs; currently the results do not yet appear to capture the detail of the R groups to produce sufficiently informative names. As splitPFAS is able to divide PFASs into a variety of different scenarios, it will be possible to investigate several different options in future work, once further SMARTS groups are dened. It is interesting to note, especially with respect to potential future efforts, that scenario (iii) was the most promising input into ClassyFire when scenario (i) failed to yield good results. While scenario (iii) was originally prepared by adjusting splitPFAS outputs in an R script (see ESI †), this scenario has been directly incorporated into splitPFAS for future use.

Extending splitPFAS beyond the original scope
This study focused on structures in the various selected groups on the OECD list (i.e., structure code 101 to 109, 201 to 209 and 401 to 410). Six major cases were identied that did not t the patterns dened currently in the splitPFAS approach, or the approach taken here in general; here we discuss these in more detail with specic examples. Most cases mentioned in Section 3.1 above are shown in Table 5, with an example structure and an explanation. Since these are best viewed side-by-side, we refer the reader to the table for more information on these cases.
For one special case, branched uorotelomer structures, the SMARTS [CH2][CH] was included in early splitPFAS calculations via the splitPFAS SMARTS input le, to capture these cases and include possible branched and ring FT structures (i.e., where the branching occurs on the FT part, the one or two non-uorinated carbons). However, this pattern caused incorrect splitting results for some compounds, such as breaking down of ring structures in the "R group" (e.g. CAS_RN 1765-92-0) or yielding more than one "R group" (e.g. CAS_RN 38550- . Aer removing the [CH2][CH] pattern, those compounds could be correctly split by [CH2]. Therefore, given the complexity of the structure of PFASs, it was decided not to consider this case in this investigation, as they do not strictly follow the n:1 or n:2 FT patterns chosen. It is, however, possible to process them with the existing splitPFAS method. Again, the patterns and the order of the patterns should be carefully selected when using splitPFAS in order to achieve optimal splitting results. For greater clarity, it is likely that subsets of lists should be processed using different SMARTS lists as input for different group of compounds to avoid such conicts in patterns, i.e., rst processing simple cases and then adjusting splitPFAS inputs to account for more complicated cases and run these only on those entries that fail the simple cases. This is discussed further below.
For a further special case, peruoroalkene derivatives, no example is shown in Table 5. These examples all failed due to a combination of factors, including the presence of an unsaturated peruoroalkyl chain and the fact that X did not match the functional groups chosen. However, as these cases do exist in the list, future efforts should consider the possibility of unsaturation in the peruoroalkyl chain, as well as linear and branched peruoroalkyl chains, and ring structures. The necessary features to do this are already built into the splitPFAS approach.
In light of the results presented here and all cases in this section, the functionality of the original splitPFAS was extended to allow users to adjust the SMARTS used to identify where to "split" the structures, accessible via the option "pacs" (PFAS Alpha Carbon SMARTS). 33 Care should be taken when trying new SMARTS for the "pacs" and "X" groups, to avoid incorrect splitting, it is likely that optimal results will be achieved when experts in PFASs and cheminformatics join forces to design optimal SMARTS codes for various PFAS groups.

Issues caused by tautomeric structures
Cheminformatics approaches also have their limitations, and tautomeric structures are oen difficult cases to handle. While it is oen easy for a trained chemist to see the equivalence in tautomers due to resonance, this can be very difficult to program into a computer (even the InChI algorithm has several tautomer-related issues). A variety of established cheminformatics toolkits exist; here we have used the CDK, whereas ClassyFire is largely implemented using ChemAxon. 23 While these are generally compatible, differences in structural interpretation can occur, especially for large and challenging structures with several tautomeric forms. Furthermore, choosing to work off the efficient SMILES notation (which is semi-human readable, as done here) rather than more information-rich formats like MOL formats can exacerbate this, as each SMILES has to be interpreted by the toolkit into a richer form for manipulation. Two entries in this work where this appears to have happened are highlighted red in Table 4 (rows 3 and 4) and the suspected tautomerization shown in Fig. 6. This structure was depicted as drawn on the le on the 4 major open depiction tools displayed in AMBIT 37 (https:// apps.ideaconsult.net/ambit2/depict using the SMILES NC(] O)C(F)(F)C(F)(F)C(F)(F)F in the respective eld) and is also displayed as such on the CompTox Chemicals Dashboard, which uses ChemAxon for depiction, so it is not clear how the reinterpretation happened in ClassyFire to yield a false clas-sication (carboximidic acid instead of peruoroacyl amide). While cases such as these will happen with any automated approach, they are relatively rare and could be captured in the future using a consensus tautomer approach; chemical databases like PubChem 38 and the CompTox Chemicals The default "pacs" SMARTS in splitPFAS currently searches for C-C or C-F bonds, thus any structures with a non-C or F atom in the uoroalkyl chain will not full the pattern, like here where the pattern is H-(C n F 2n )-X-R, where here X ¼ C(]O). Other members followed e.g. a Cl-(C n F 2n )-X-R pattern. These can be captured by adjusting the "pacs" option The functional group R is F only 375-72-4 These substances likewise failed the SMARTS pattern encoded into splitPFAS, which currently excludes compounds with a generic formula C n F 2n+1 -X-F. This could be addressed by adjusting the "pacs" option as well in future studies

355-66-8
These examples were outside the scope dened for this article, examples of the form R 1 -X-(C n F 2n )-X-R 2 are split correctly, but result in two PFAS chain results, which we did not consider further here

73980-71-9
For compounds in the form of (C n F 2n+1 )X-R-X(C m F 2m+1 ), the main issue is how to dene C-X-R.

Outlook
In this study, two cheminformatics approaches (i.e., splitPFAS and ClassyFire) were evaluated to explore the potential of using such automated, open tools to enable stakeholders to systematically categorize and name PFASs. In particular, splitPFAS has proven useful to identify specic PFAS patterns and thus can be helpful in systematic categorization of PFASs in general. For example, splitPFAS has successfully identied a number of cases where PFASs were assigned incorrect structure codes/ categories in the OECD PFAS list. Therefore, one particular future use of splitPFAS can be to assist stakeholders, particularly those who are not familiar with the complex PFAS class, in curating and processing long lists of PFASs with pre-dened structure categories. Regulators and manufacturers may be able to use splitPFAS to process their own inventories and identify certain PFASs of interest (e.g. PFOA-related compounds under the Stockholm Convention). While splitPFAS holds a promising future, it should also be noticed that the predened structure categories used here are still limited, as this study focused on proof-of-concept. In the future, splitPFAS should be developed to encompass a wider range of PFASs by dening further major dividing groups X, e.g. X ¼ "P(]O)". Further work should also be done to capture the cases that are not yet perfectly handled, such as (1) branched and cyclic per-uoroalkyl chains, (2) unsaturated peruoroalkyl chains, (3) polyuoroalkyl chains (e.g. H-or Cl-C n F 2n -R) and (4) per-uoroalkyl ether chains (e.g. C n F 2n+1 -O-C m F 2m+1 ). While the rules to be used by splitPFAS in some of these areas are yet to be dened, the functionality is built in and ready to be applied and it is likely that extensions to the SMARTS used in splitPFAS could provide useful functionality for several different audiences.
In contrast, using ontology-based approaches such as ClassyFire in systematic categorisation and naming of PFASs warrants greater investigation and discussion. The results do not appear sufficiently detailed at this stage to provide enough information for systematic naming. However, a more detailed training set, created using e.g., the splitPFAS approach, may yield sufficient specialized rules in the future to enable this.

Conflicts of interest
There are no conicts of interest to declare.