Anita
Sosnowska
*a,
Natalia
Bulawska
a,
Dominika
Kowalska
a and
Tomasz
Puzyn
*ab
aQSAR Lab, ul. Trzy Lipy 3, Gdańsk, Poland. E-mail: a.sosnowska@qsarlab.com; t.puzyn@qsarlab.com
bUniversity of Gdansk, Faculty of Chemistry, Wita Stwosza 63, 80-308 Gdansk, Poland. E-mail: tomasz.puzyn@ug.edu.pl
First published on 10th February 2023
Per- and polyfluoroalkyl substances (PFAS) are widespread in the environment. Properly developed QSAR/QSPR models can be used to assess the impact of these chemicals on humans and the environment. This work assesses 38 in silico models developed for this group of compounds, which mainly show physicochemical (22), and also toxic (8) and ecotoxic (8) properties. The evaluation of the models was carried out based on the (Q)SAR Model Reporting Format (QMRF), which was found in the QSAR Database (5) or was prepared manually, according to the information contained in scientific publications based on the QMRFEditor-v3.0.0 format (33). We based our evaluations on an individual assessment of each of the OECD principles described in the document and then summing up everything together. During the analysis, we identified 22 models as scientifically valid and could be used in the prediction of new compounds. Twelve of them contained all the information necessary to reproduce the model, and another 10, despite the lack of some information, are still reproducible. The other 16 models do not contain enough information to reproduce them and therefore they are scientifically invalid. The present work allows identifying the remaining gaps, needs, and recommendations that should be considered in further development of predictive models in the PFAS area.
On one hand, the unique properties of PFAS make them possible to be used in numerous applications; however at the same time these properties make PFAS dangerous for the environment and humans. PFAS may affect the immune,4 digestive,5 metabolic,6,7 endocrine,8,9 and nervous systems.10 They contribute to the maturation change and increase the risk of developing breast, kidney, testis, prostate, and ovary cancer.11 PFAS may also act as endocrine disruptors by influencing for example thyroid hormone levels. Studies are indicating that PFAS can lower a woman's chances of getting pregnant, as well as affect the growth, learning, and behavior of infants and older children. From the environmental point of view, PFAS are not eliminated by natural barriers of terrestrial and aquatic ecosystems and therefore can concentrate in water, nutrients, etc.12 Due to the solubility of many PFAS in water and their low potential for absorption onto particles, it is very difficult to remove PFAS from the aquatic environment, including drinking water sources, using conventional methods. That is why they are known also as “Forever Chemicals”.
All reports and the literature pointed out that PFAS may trigger adverse effects, and that they are very persistent, mobile (PM), and difficult to remove from the environment prompting the regulatory authorities to take a closer look at this group of chemicals. In fact, in recent years the production and use of several groups of PFAS are restricted under REACH Regulations. Two of the most known and frequently used PFAS have been included in the Stockholm International Convention – perfluorooctane sulfonic acid (PFOS)13 and its derivatives since 2009, and perfluorooctanoic acid (PFOA),14 its salts and PFOA related compounds since 2020. Several European countries (mainly Norway, Germany, Netherlands, Denmark, and Sweden) constantly raise initiatives that lead to proposed restrictions on the different groups of PFAS.
Based on these initiatives, European Commission decided that from February 2023 the next group of PFAS – perfluorinated carboxylic acids (C9-14 PFCAs),15 their precursors and salts will be restricted in the EU/EEA. It seems that soon such restrictions will affect more and more groups of those chemicals, considering that many of them are also on the REACH candidate list of substances of very high concern (SVHC), which classify them as carcinogens, mutagens, and reprotoxicants (CMRs), and also persistent, bioaccumulative and toxic/very persistent and very bioaccumulative (PBTs/vPvBs) chemicals. PFAS are the subject of interest of the European Green Deal initiative and Zero pollution action plan16 which assumes the reduction of levels of pollutants in air, water, and soil and creates a toxic-free environment. There are three European Horizon 2020 projects founded (PROMISCES (101036449), SCENARIOS (101036756), ZeroPM (101037509)) dealing with this topic and proposing new strategies to protect the environment and human health from PFAS and PM chemicals.
Even though PFAS are produced for more than 80 years now, the fact that they are composed of a huge group (about 5000) of compounds shows that only a small number of them have been fully tested and their properties are known. It is impossible to test experimentally every substance; therefore, ECHA recommended the holistic group approach in the regulatory assessment and risk management based on the EU strategy for PFAS.17 In this respect, computational methods for deriving the activities/properties of PFAS can be widely applied to replace/complete experimental methods. Using in silico methods, data analysis, and machine learning, it is possible to determine the toxic potential/physicochemical properties of a large set of compounds based only on the small, experimental set of available data.
One of the groups of methods which is based on the similarity of relationships, recommended by REACH for supporting the substance registration process, is quantitative structure–activity(property) relationship models (QSAR/QSPR models). The QSAR/QSPR models relate the set of descriptors (X) with the response variables (Y). The chemical structure (variable (X)) is represented numerically through descriptors such as molar mass, number of atoms, number of bonds, number of aromatic rings, hydrophobicity, etc. (Scheme 1).19–22 The choice of the appropriate modeling method depends both on the nature of the modeled quantity and the nature of the relationship between the descriptors and the predicted value (linear and non-linear). If the modeled variable is quantified, then linear and nonlinear regression techniques can be used for modeling. When the data is qualitative then the selection of modeling methods is limited to the classification one. The credibility of the models is confirmed by appropriate statistical parameters. A properly developed QSAR model should be characterized by a good fit to the training set, robustness, and defined predictive ability.19 The model developed in this way, in addition to providing information about the properties of chemical compounds, should also support the understanding of the mechanisms related to the biological activity of substances.
In silico methods are an attractive and faster alternative compared to time-consuming laboratory and clinical research methods. They are also used to support experimental data or to support prioritization in the absence of experimental data for regulatory purpose. Based on the existing registration dossiers, the European Chemicals Agency (ECHA) carried out an analysis of the used method for obtaining information on the properties of the substances, which showed that the alternative methods to animal testing specified and recommended in REACH are successfully used by registrants. Annex XI in REACH regulation allows for the application of the (Q)SAR models as a standard mode of research.20 However, to use such models for predictions supporting the substance registration process, certain conditions have to be fulfilled. The model used for prediction should follow the golden standards established for QSAR models in 2004.21 In accordance with these rules “to facilitate the consideration of a (Q)SAR model for regulatory purposes, it should be associated with the following information:
1. a defined endpoint;
2. an unambiguous algorithm;
3. a defined domain of applicability;
4. appropriate measures of goodness-of-fit, robustness, and predictivity;
5. a mechanistic interpretation, if possible”.21,22
Only models which fulfill all the above-mentioned requirements can be used for predicting biological activity/physical parameters to support the registration of the substance. What is more, it is recommended that a well-documented description of the applied model should be attached to the registration dossier. For this purpose, QMRF ((Q)SAR Model Reporting Format) was proposed in 2007.22
The structure of the report is divided into 10 sections (Scheme 2) which refer to different aspects of QSAR/QSPR models required for regulatory purposes.23 Below we specified what should be indicated in each section of the QMRF report.
Section 1 (QSAR identifier) is related to the description of the model, where the title, other related models, and software coding of the model should be indicated.
Section 2 (general information) provides general information about the developed model, e.g., the date and authors who prepared QMRF, and the authors of the model and referring publication, available information of the model (e.g. training and external validation sets, source code, and algorithm) and information if there exist other QMRF documents for the exact model.
Section 3 (defining the endpoint – OECD principle 1) should specify the endpoint (physicochemical, biological, or environmental effects – from the pre-defined classification)23 species and units for the endpoint, the experimental protocol followed by the collection of the experimental sets, and the information on the data quality assessment and the relationship of the modeled (dependent variable) and measured endpoints (e.g. transformation, etc.).
Section 4 (defining the algorithm – OECD principle 2) refers to the type of the model (e.g. SAR, QSAR, Expert-based Systems, and Neural Networks) and all information connected with the developed relationship. In particular, it should be indicated which descriptors are used in the model, how they were estimated, how the selection of the descriptors was performed, and what is the ratio of the descriptors used in the model to chemicals in the training set. Moreover, the method and software used to derive the relationship (algorithm) should be specified in this section.
Section 5 (defining the applicability domain – OECD principle 3) provides detailed information about the chemical space in which the model predicts properly. Here the comments on methods and software used for defining the applicability domain, and their limitations should be mentioned.
Section 6 (defining goodness-of-fit and robustness – OECD principle 4) contains notes about the statistical analysis that should be performed to establish the performance of the model, consisting also of the internal validation (i.e., measures of goodness-of-fit and robustness). In this section, there is a need to give information about the availability of the training set, with all specifications, e.g., CAS number, SMILES, Mol file, etc. Moreover, the data for an endpoint for the modeled values and the descriptors values for chemicals in the training set should be included here. The authors should indicate if the model is developed based on the rare data, or if any transformation was applied. All statistics describing the goodness-of-fit ((r2, r2 adjusted, standard error, sensitivity, specificity, false negatives, false positives, predictive values, etc.),26,27 and robustness (e.g. leave-one-out and leave-many-out cross-validation,28 Y-scrambling,29 bootstrap29 or any other corresponding statistics) should be reported.30
Section 7 (defining predictivity – OECD principle 4) is associated with the external validation of the model and determination of the model's predictive power, which is the measure that describes how well the models predict endpoints for new chemicals, which was not considered to develop the model. In this section the following information should be provided: the availability of the validation set, with all specifications e.g., CAS number, SMILES, Mol file, etc. Data for each descriptor and dependent variable for the external validation set and the information on how the validation set was defined (e.g., randomly, using a specific algorithm, searching in the literature, etc.) need to be presented. Moreover, all statistics obtained by external validation28,31–33 and predictivity assessment (discussion on the magnitude of the validation set and if it is sufficient and the representative of the applicability domain) should be specified.30
Section 8 (providing the mechanistic interpretation – OECD principle 5) refers to the mechanistic interpretation of the presented model. Here, information on the mechanistic basis of the model should be provided. The description of the structural features that are responsible for the modeled properties should be demonstrated. Also, if possible, the physicochemical interpretation of the used descriptors should be explained. It should be pointed out if the mechanistic interpretation was determined a priori (before modeling, and the training set and descriptors were fitted to the already known statements) or posteriori (after modeling, and it was the result of the interpretation of the obtained relationship).
Section 9 (miscellaneous information) includes any other relevant and useful comments not indicated above, a bibliography (references not strictly associated with the developed model), and the ESI† (if it is attached to the QMRF, the ESI† may include the training and test sets submitted in defined file formats).
Section 10 (summary for the JRC QSAR model database) 24 is a summary section specified for the JRC Database. Here the QMRF number is generated and the publication date, keywords, and comments relevant to the publication of the QMRF in the JRC Database (e.g., updates) should be reported.
The QSAR models are playing an increasingly important role in defining the properties for the hazard and risk assessment of chemicals. Using this method, it is possible to search for compounds that are safe for the environment and humans but still exhibit certain desired properties. New compounds can be registered based on the QSAR models validated in the form of QMRF, only then documentation is standardized and predictions with these models are reliable.34
In light of the considerations above, the present work attempts on summarizing the previous QSAR/QSPR studies of the PFAS and verifying whether the models developed so far for predicting the physicochemical properties and biological activity are scientifically valid and could be easily applied to predict the properties for new (safer) compounds. What is more, this review will allow one to highlight the remaining gaps in this field and define further challenges related to applying the computational methods for predicting the activity and properties of PFASs.
In this way, we collected 38 models: 22 for predicting physicochemical properties, 8 for toxicological, and 8 models for ecotoxicological endpoints for PFASs (Table 1). Among those 38 models, five have ready-made QMRF documents available in the QSAR Database35 (VP2, S2, LC50(1) – acute inhalation toxicity, LC50(2) – acute inhalation toxicity, and LD50(1) – acute oral toxicity), while one of them (LC50(1)) has also entered the JRC database.24 It is worth mentioning that 5 ready-made QMRF documents (listed above) are also implemented in the QSARINS-Chem software, which is dedicated to the development and validation of QSAR models.36 For the rest of the collected models, the missing QMRFs have been prepared based on the format from QMRFEditor-v3.0.0.25 Ten sections (see the Introduction section) of the QMRF document were completed based on the information provided in the original papers and their supplements. All prepared QMRF documents are presented in ESI 2.† In the next step, all developed QMRFs were evaluated in terms of the availability of information on each of the OECD principle sections (sections 3–8), Table 2. The presence/absence of important information for the correctness of the particular model was assessed using +/−, whereas the presence/absence of additional information (comments) was marked using ✓/✗. Next, the results were analyzed in terms of the suitability of the developed QMRFs for regulatory purposes and the possibility of easily repeating the models and predicting missing values for new compounds. For the comparison between the collected models, we considered only the important information on the QMRF (bold in Table 2), not additional (comments). The OECD principles were evaluated one by one and five color-coded classes were established according to the percentages of the presence of necessary information in QMRFs. The thresholds of these classes were as follows: green (100–80%), gray (79–60%), yellow (59–40%), light orange (39–20%), and dark orange (19–0%). In the next step, we compile the principles together and conclude about the suitability of the available models for repeating and applying them in prediction for new compounds.
Physicochemical endpoints | Vapor pressure | VP1,37 VP2,38 VP3,39 VP440 |
Water solubility | S1,37 S2,38 S339 | |
Octanol–water partition | K OW 39 | |
Air–water partition | K AW 39 | |
Octanol–air partition coefficient | K OA 39 | |
Fluid–fluid interfacial adsorption coefficient | K i 41 | |
Melting point | MP1,42 MP2,42 MP3,42 MP442 | |
Boiling point | BP1,42 BP2,42 BP3,42 BP442 | |
Critical micelle concentration | CMC37 | |
Defluorination factor | DF43 | |
C–F bond dissociation energy | CFDE44 | |
Toxicological endpoints | T4-TTR binding (TTR) | IC50(3)45 |
Acute inhalation toxicity (Rattus, Mus musculus) | LC50(1),38 LC50(2),38 LC50(3),46 LC50(4)46 | |
Acute oral toxicity (Rattus, Mus musculus) | LD50(1),38 LD50(2),47 LD50(3)47 | |
Ecotoxicological endpoints | Cytotoxicity (Xenopus tropicalis) | IC50(1)48 |
Developmental toxicity (Danio rerio) | IC50(2)49 | |
Toxic effect on root elongation (Lactuca sativa, Pseudokirchneriella subcapitata) | EC50(1),50 EC50(2)50 | |
Acute toxicity (Pseudokirchneriella subcapitata, Chlorella vulgaris, Daphnia magna, Danio rerio) | EC50(3),51 EC50(4),51 LC50(5),51 LC50(6)51 |
Using + /, the presence/absence of important information for the correctness of the particular model was assessed. Using ✓/✗, the presence/absence the additional information (comments) was marked; — not required.a Information included in ready QMRF but not available in the publication. The fulfilment of each of the OECD principles: color green (100–80%), yellow (79–60%), gray (59–40%), light orange (39–20%), and dark orange (19–0%). The SV – model is scientifically valid. The SN – model is not scientifically valid. |
---|
![]() |
![]() |
![]() |
The QMRF format was proposed in 2007, and all models collected in this work were developed after 2007, however surprisingly, only 5 of them have provided QMRF documents available in the QSAR Database35 (models for the following endpoints: VP2, S2, LC50(1) – acute inhalation toxicity, LC50(2) – acute inhalation toxicity, and LD50(1) – acute oral toxicity). In addition, the QMRF document for LC50(1) has also entered the JRC database.24 The analysis itself shows that in most developed models the authors did not consider their application for regulatory purposes, but rather would like to explain the processes and relationships between the structure of PFASs and their properties. However, for the model to be used to predict new compounds, QMRFs must be available. Therefore, for the 33 collected models, we have completed the QMRFs using the information provided in the publications. Next, we evaluated each QMRF in terms of fulfillment of the OECD principles and verified if all necessary information to repeat the model is available in the paper.
Summarizing the evaluation of all available QSAR/QSPR models in terms of fulfilling five OECD principles it could be stated that 6 out of 38 (VP2, S2, acute inhalation toxicity – LC50(1–2), acute oral toxicity LD50(1), cytotoxicity IC50(1)) are scientifically valid – they contained all information necessary to reproduce the model and predict endpoints values for the new compounds. 16 (VP1, VP4, S1, Ki, MP(1–3), BP(1–3), CMC, LC50(3–4) – acute inhalation toxicity, LD50(2–3) – acute oral toxicity, and IC50(2) – developmental toxicity) did not have some information, but the models are also reproducible. In the case of the other 16 models, they do not have details on many more items, and therefore, probably should not be used to predict the value of these endpoints for the new compounds. The presence of QSAR models built on PFAS mixtures is worth mentioning here. This model for mixtures48 cannot be used for the prediction of the new single PFASs. However, this paper aimed to review possibly all QSAR/QSPR models related to PFAS, and evaluate their possibility of reproduction, and therefore we also included it. This model is very important in terms of mixtures which are more and more often considered in the PFAS assessment and is the subject of many currently conducted research studies in European projects (e.g. in PARC, PROMISCESS).
Acronim | CAS number | Structure | Carbon chain length | Exp. water solubilityb log (mg L−1) 20 °C ± 0.5 | Pred. water solubilityc log (mg L−1)a |
---|---|---|---|---|---|
a logAqS = −0.418(±1.940) − 0.003(±0001)T(F..F) + 5.185(±3.849)SIC1. b Inoue et al.70. c Bhhatarai and Gramatica37. | |||||
PFUnA | 2058-94-8 |
![]() |
11 | −0.22 | −2.0282875 |
PFTeDA | 376-06-7 |
![]() |
14 | −0.53 | −5.678205 |
PFHxDA | 67905-19-5 |
![]() |
16 | −0.82 | −8.95235 |
Acronim | CAS number | Structure | Carbon chain length | Exp. vapor pressure log (Pa) | Pred. vapor pressure log (Pa)a,d |
---|---|---|---|---|---|
a logVP (mmHg) = 7.97 − 0.16 × F03[C–F] − 3.16 × ACC − 0.64 × nDB. b Kwan71. c Zhang et al.72. d Bhhatarai and Gramatica37. | |||||
PFPrA | 422-64-0 |
![]() |
3 | 3.59b | 3.31 |
PFPeA | 2706-90-3 |
![]() |
5 | 3.43b | 2.6 |
PFHxA | 307-24-4 |
![]() |
6 | 1.1c | 2.18 |
Similarly, as in the case of water solubility, the model for predicting vapor pressure (VP1) was implemented for three compounds that are not included in the training set (CAS 422-64-0, CAS 2706-90-3, and CAS 307-24-4). Both experimental data found for three perfluorocarboxylic acids and data used by Bhhatarai and Gramatica37 to build the model were obtained at 25 °C. However, the differences in the log(Pa) values may be due to the different methods and conditions of data collection. Despite these differences, a descending trend in vapor pressure can be observed as a function of the increasing number of carbon atoms in the PFAS main chain. External compounds consist of 3–5 carbon atoms; however, eight carboxylic acids in the training set of the model include 2–12 carbon atoms in the main chain, which leads to correct prediction.
The 3rd and 4th OECD principles showed the most shortcomings. Regarding the 3rd OECD principle only nearly half of the prepared QMRFs described the applicability domain of the developed models. This is very surprising since each good QSAR/QSPR model should have a defined space of validity. In another way, the model could be used to predict chemicals for which the predictions could be unacceptably unreliable. Moreover, considering the 4th OECD principle many collected models have not been correctly validated or have not provided all required parameters. A variety of statistics validation techniques are available to assess the robustness and predictability of models, and different parameters are now routinely used to express these aspects of model performance. They are the standards that the developed QSAR models should follow. Another issue is related to the availability of the data with which the model was calibrated and validated. Providing the information on endpoints and descriptors values for training and validation sets is required to reproduce the model and properly predict the values for new compounds. It is an unwritten standard in QSAR/QSPR model building. However, more than half of the available models for PFASs contain this information.
Summing up, more than half (22) of the collected models are scientifically valid based on the OECD principles and are ready to be used to predict the properties of new compounds. The rest of the models can be used to gain knowledge about the studied phenomenon, but they cannot be used to register new compounds, e.g., derivatives of PFAS, which would not have a negative impact on the environment and human life, simultaneously maintaining the desired properties. The scientific validity of the QSAR/QSPR model is the condition sine qua non for regulatory acceptance for using such a model for the prediction of the new compounds.
The present study shows two major issues when analyzing the available predictive models dedicated to PFAS. Firstly, existing models are very limited in helping to characterize or assess the environmental fate and transport of PFAS. They do not focus on the relevant physicochemical endpoints in this field. Because PFASs are widely used in commercial and industrial products they have been frequently detected in industrialized and developing countries in drinking water, surface water, and groundwater across. Their solubility in water (especially short chain PFASs) is high, they are often persistent during degradation and treatment, and the understanding of their degradation products and toxicity is limited. Therefore, an innovative integrated modeling approach to predict the transport pathways and fate of PFASs in different environmental compartments (e.g., soil, sediment, groundwater, and surface water) is needed. In silico predictive models (especially QSPR approaches) can be here helpful in generating inputs to the fate and transport models. However, the model endpoints here should follow the needs required by the fate and transport models. For example, there are available QSPR models dedicated to vapor pressure, whereas Henry's Law may be of more appropriate and environmental relevance. There is still a need to develop predictive, scientifically valid models for all partition coefficients (octanol–water, air–water, air–octanol), degradation rates (biodegradation, abiotic degradation, and photodegradation), soil toxicity and bioconcentration factor for the whole group of PFASs, so that it is possible to model the fate and transport of those groups.
The other issue connected with the development of predictive models is the availability of experimental data for model training. Although PFASs are a very large group of compounds (+5000), experimental data for relevant environmental and toxicity endpoints are available only for a small group of them, which does not represent equally different groups/classes of these compounds. In such a case, the solution is to develop local QSAR/QSPR models for a single class of compounds, where it is possible to use the quantum chemistry method to simulate the physicochemical endpoint's values for model training. Of course, in the second approach also several experimental data are needed to validate/support theoretical calculations. Moreover, experimental data are also needed for external validation of available models, to determine their predictivity in new chemical space. It is very important due to the fact that a majority of available models are dedicated only to the PFAS compounds (the training sets contain also other pollutants); therefore they cannot be expected to work properly for every PFAS. In fact, it should be verified for which groups of PFASs the available models work. In this case, it would be necessary to select a truly external set of compounds, several PFASs belonging to different groups, then conduct experimental studies and compare them with the results obtained by applying QSAR/QSPR models. In this way, it will be possible to show how predictive these models are, and what are the limitations vis a vis the type of compound in the external dataset.
Secondly, the QSAR models available in the literature focused mainly on acute toxicity (EC50 or LC50), and they do not indicate a clear relationship between the structure of the PFAS and the adverse outcome pathway (AOP) and molecular initiating events (MIE). In fact, they only explain the basic structure/activity relationship (statistical approach) but do not indicate the real mechanism of action. Recent studies69 indicate that exposure to PFASs may have a negative impact on all components of metabolic syndrome. This was proved not only on individual PFASs but also mixtures of these compounds. Taking into account these studies, an appropriate method of modeling mixtures should be developed. Such steps were taken in the newly established Partnership for the Assessment of Risk from Chemicals (PARC) which is an innovative research program to support EU and national institutions involved in chemical risk assessment and risk management. All the above-mentioned issues should be considered in further development of the predictive models to be valuable and applicable in the human and risk assessment of these compounds.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2gc04341f |
This journal is © The Royal Society of Chemistry 2023 |