Open Access Article
Paulina Alulema-Pullupaxi
ac,
Fatih Evrendilek
bc,
Dilara Hatinoglu
ac,
Simin Moavenzadeh Ghaznavic,
Kenneth Mensah
ac,
Manisha Choudhary
ce,
Sonora Ortiz
d and
Onur Apul
*ac
aDepartment of Civil and Environmental Engineering, Pennsylvania State University, University Park, PA 16802, USA. E-mail: oga5061@psu.edu
bUniversity of Maine Cooperative Extension, Orono, ME 04469, USA
cDepartment of Civil and Environmental Engineering, University of Maine, Orono, ME 04469, USA
dDepartment of Ecology and Environmental Sciences, University of Maine, Orono, ME 04469, USA
eDepartment of Biological and Agricultural Engineering, Kansas State University, Manhattan, KS 66506, USA
First published on 30th March 2026
This study presents a modeling approach to predict the soil-water partitioning coefficient (Kd, L kg−1) for per- and poly-fluoroalkyl substances (PFASs) as a function of their molecular connectivity indices (MCIs) and soil properties (soil organic carbon, SOC, %, and cation exchange capacity, CEC, cmol kg−1). The modeling framework involved compiling data, developing models, and evaluating model performance via interpretation, external validation, and scenario analyses. Two datasets consisting of simple and valence MCIs per PFAS were used: (i) carboxylic-PFCA dataset (N = 327) had only carboxylic compounds (C4–C12) and (ii) PFAS-full dataset (N = 699) entailed carboxylic acids (C4–C12), sulfonic acids (C4–C10) and fluorotelomers (C4–C8). Our multi-criteria approach revealed that the seventh-order valence path (VP-7) related to polarizability and molecular size and the third-order simple path (SP-3) related to molecular size and chain structure emerged as key predictors for the carboxylic-PFAS and PFAS-full datasets, respectively. Elastic net-regularized linear regression (MLREN) and artificial neural networks (ANNs) demonstrated that MCIs improved the predictive accuracy. For the PFAS-full dataset, six-predictor models (MCIs + soil properties) yielded a high predictive accuracy (Rpred2 = 83.7–84.9%); however, a three-predictor MLREN model (SP-3, SOC, and CEC; Rpred2 = 77.9%) achieved the highest external generalization (Rext2 = 52.4%). SP-3 accounted for the largest share of predictive power (68–95%), dominating the model performance (94–97%). Scenario analyses revealed that while deterministic predictions remained stable, probabilistic modeling is crucial for capturing the rare but impactful extremes. Overall, our study highlights the practical advantage of MCIs as versatile and scalable tools for predicting the adsorption of diverse PFAS, including short-chain, partially fluorinated, and less commonly studied PFASs. In the long term, this tool can provide data for preliminary, rapid, site-specific risk assessment for PFAS-impacted sites.
Environmental significancePer- and poly-fluoroalkyl substances (PFAS) are persistent, mobile, and toxic environmental contaminants that threaten soil and groundwater resources. Understanding and predicting how PFAS interact with soils is critical for managing their risks. However, laboratory testing of PFAS sorption is time- and resource-intensive, particularly for emerging or understudied compounds. This study demonstrates a novel and scalable modeling framework that integrates molecular connectivity indices (MCIs) with soil properties to predict the PFAS sorption behavior. By capturing their key molecular features such as size and branching, this model enables rapid estimation of PFAS mobility across a wide range of compounds and soils. This approach can serve as a screening tool for site-specific risk assessment and help prioritize remediation efforts. The integration of deterministic and probabilistic analyses also enhances environmental decision-making by identifying potential high-risk scenarios. |
Existing predictive models such as quantitative structure–activity relationship (QSAR), quantitative structure–property relationship (QSPR), and linear-solvation energy relationship (LSER) have been reported to predict parameters that govern the PFAS environmental behavior.13–17 These models typically use physicochemical descriptors, molecular properties (i.e., molar volume, molecular weight, fluorine number, carbon number, and carbon number in tail), and solvatochromic Abraham descriptors as explanatory variables and primarily target long-chain PFAS (either with carboxylic or sulfonic functional groups). However, these models often fail to represent the broader chemical diversity of PFAS, particularly emerging short-chain compounds, branched compounds or partially fluorinated precursors like fluorotelomer sulfonates (FTS), which can degrade into short-chain regulated compounds.18,19 In particular, LSER models, although mechanistically informative, are limited to neutral compounds, which restricts their use under diverse environmental conditions for ionizable PFAS. For this, our previous study has investigated the adjustment of solvatochromic predictors for carboxylic PFAS.13,20 To advance this approach, in accordance with the rapidly developing PFAS literature, we now explore molecular connectivity indices (MCIs) as a versatile and chemically inclusive alternative for a comprehensive predictive tool.
MCIs are topological descriptors derived from the molecular graph, quantifying aspects of molecular size, branching, and derived electronic properties based on atom connectivity.21 Unlike descriptors requiring specific functional group information or 3D conformation, MCIs encode structural information inherent in the bonding topology, potentially offering broader applicability across diverse PFAS structures, including those lacking traditional functional groups or with complex branching.22 These descriptors have proven successful in predicting octanol–water (Kow) and octanol–air (Koa) partition coefficients,23,24 bioconcentration ratios,25,26 and distribution coefficients for the sorption of aromatic compounds by carbon-based adsorbents and soils,27–29 where zero-, first-, and third-order simple and valence indices were found to be the best topological predictors.23,25–28 However, their systematic application and evaluation for the modeling partitioning of PFAS sorption in soil (Kd) remain unexplored, representing a critical knowledge gap.
Therefore, our study is the first systematic development and validation of MCIs for predicting the PFAS sorption in soils in combination with soil physicochemical attributes. To achieve this goal, the research was designed with four specific objectives. First, we provide a baseline for validating the MCIs framework within a chemically homogeneous subset by comparing the predictive performance of MCIs with Abraham-solvatochromic descriptors to predict log
Kd of perfluoroalkyl carboxylic acids (PFCAs) before extending the analysis to a broad structurally diverse set of PFAS. Second, we develop and evaluate the predictive and generalization abilities of linear and machine-learning models to predict log
Kd for multiple PFAS subclasses (i.e., perfluoroalkyl carboxylic acids – PFCAs, perfluoroalkyl sulfonic acids – PFSAs, and fluorotelomer sulfonates – FTS). Third, we characterize the predictor importance and interaction effects of molecular and soil predictors using Monte Carlo simulations to elucidate key factors influencing PFAS sorption. Finally, we conduct scenario analyses to identify the conditions that maximize PFAS sorption (log
Kd) via composite desirability function (D) and evaluate model uncertainty and robustness under boundary minima via Monte Carlo simulations.
Kd) of the group of PFAS included in each dataset, while explanatory variables included both PFAS molecular descriptors (i.e., molecular connectivity indices (MCIs) and Abraham descriptors) and soil physicochemical properties (i.e., soil organic carbon – SOC, %; cationic exchange capacity – CEC, cmol kg−1; and soil pH).
The first dataset (N = 327), called carboxylic–PFAS dataset, contained nine carboxylic PFAS (C4–C12), and was used to evaluate the predictive capacity of MCIs by comparing them against Abraham descriptors. The second dataset (N = 699), called PFAS-full dataset, extended the chemical coverage that include 9 perfluorocarboxylic acids – PFCAs (C4–C12), 7 acid perfluorosulfonic acids – PFSAs (C4–C10), and 3 fluorotelomer sulfonates – FTS (C4–C8), and it was used for developing and comparing MCIs-based predictive models for PFAS. Both datasets were solely used for model development and internal validation, each randomly split into 75% training and 25% validation subsets.
Molecular descriptor calculation began with retrieving the simplified molecular input line entry system (SMILES) strings for each PFAS molecule in both neutral and ionic forms from the PubChem database.30 Using the ChemDes platform (https://www.scbdd.com/chemdes) and the PaDEL Descriptor Calculator,31 SMILES were transformed into 46 MCIs in acidic and ionic forms of each compund. This study included the following two classes: (1) simple indices that encode sigma-bonding patterns from structural formula and (2) valence indices that incorporate sigma, pi, and lone-pair electrons, thus more detailed electronic information. Index order (n = 0–7, zeroth to seventh order) and fragment type (e.g., simple or valence path, SP or VP; simple or valence cluster, SC or VC) define the structural scope of each MCIs, as described in Text S1. Fig. 2 illustrates the derivation and structural interpretation of MCIs, using perfluorobutanoic acid (PFBA) as an example. These descriptors capture how atomic connectivity influences molecular behavior, providing a structural basis for linking PFAS chemistry to sorption processes in soils. On the other hand, Abraham descriptors were calculated for both neutral and ionic forms following the methodology report by Hatinoglu et al. (2023),13 with a total of five descriptors at neutral form and six descriptors at ionic form (′) of each compound: excess molar refraction (E, E′), dipolarity/polarizability (S, S′), hydrogen bond acidity (A, A′), hydrogen bond basicity (B, B′), and molar volume (V, V), and an additional descriptor (J−) to distinguish ionized from neutral species (Table S4). These descriptors are widely used in LSER models to quantify solute–solvent interactions. Both sets of descriptors (MCIs and Abraham descriptors) were corrected to reflect PFAS dissociation states under experimental conditions, as summarized in Text S1.
The experimental Kd values and corresponding soil properties were compiled from literature, specifically from peer-reviewed articles that are reporting experimentally derived sorption isotherms, following the methodology reported in our recent publication.32 The sorption isotherm slope, Kd, was derived from linear fits, with consideration of test conditions such as SOC (%), CEC (cmol kg−1), pH, and the structural characteristics of PFAS compounds, including short and long C-chains and functional groups. When adsorption was non-linear and Kd was not directly reported, Kd was derived from the initial linear (Henry) region of the isotherm, corresponding to Ce ≈ 10−5 mg L−1, where sorption is proportional to solute concentration and independent of site saturation. The concentration of a compound adsorbed to the soil matrix (Cs, mg kg−1) and its concentration in the aqueous phase (Cw, mg L−1) from this region were used to calculate Kd. For datasets reporting Koc only (organic carbon normalized distribution coefficient), Kd was back calculated using the reported foc (fraction of the solid that is organic carbon). All values were standardized to mg kg−1 and mg L−1 for consistency.32 The compiled datasets encompassed a wide range of soil matrices, including uncontaminated reference soils,9,33–43 AFFF-contaminated soils,11,44–46 and pure clay minerals47,48 (e.g., montmorillonite, kaolinite, and illite), collected across the U.S., Canada, Sweden, China, and South Africa, capturing global variability in soil composition and contamination sources (Table S2).
The third dataset (N = 658), called independent external dataset, was partially compiled from a recent published study49 and contained log
Kd (L kg−1), SOC (%), and CEC (cmol kg−1) data. To evaluate model generalization, it was subdivided into soil-only and combined soil & sediments groups, as presented in Fig. 3. The soil-only subset consisted of 31 PFAS across multiple types: 11 PFCAs (C4–C14), 6 PFSAs (C4–C10), 2 FTS (C4–C6), 4 fluorotelomer alcohols – FTOHs (C4–C10), 2 zwitterionic PFAS, 3 perfluoro phosphonic acids – PFPAs (C6–C10); plus 8
:
2 chlorinated polyfluoroalkyl ether sulfonic acid – 8
:
2Cl-PFESA, trifluoroacetic acid – TFA, and perfluorooctaneamido ammonium – PFOAAmS, while the soil & sediments subset included the same 31 PFAS types plus four additional compounds (i.e., N-methyl perfluorooctanesulfonamido acetic acid – N-MeFOSAA, perfluorodecanoic sulfonic acid – PFDS, N-ethyl perfluorooctanesulfonamido acetic acid – N-EtFOSAA, and perfluorooctane sulfonamide – PFOSA). Each subset was exclusively used for external validation to assess generalization capability across environmental conditions, ensuring full independence from training data and preventing data leakage. No normalization, scaling or transformation was applied prior to modeling to preserve original data distributions.
![]() | ||
| Fig. 3 PFAS subclasses included in the independent external dataset subdivided by their environmental matrix type: (a) soil-only and (b) combined soil and sediment samples. | ||
Kd was developed following three integrated stages: (i) exploratory data analysis; (ii) predictor and model screening; and (iii) internal model predictive performance using carboxylic-PFAS and PFAS-full datasets. The first stage included the calculation of descriptive statistics (Tables S5, S6 and Fig. S1–S3), and identifying potential outliers could distort observed PFAS–soil relationships. Outlier detection followed a consistent procedure across carboxylic-PFAS and PFAS-full datasets using a statistical criterion on the interquartile range (IQR).50 Specifically, data points lying outside the range defined by the 10th and 90th percentiles ±3 × IQR were identified and removed. Outliers removed included one CEC value from the carboxylic-PFAS dataset, 51 SOC and two CEC outliers from the PFAS-full dataset. For the independent external dataset, we applied our own outlier detection, as stated above, to ensure consistency with our preprocessing pipeline, even though the original source may have its own criteria. A total of 30 SOC outliers were removed from this dataset.
Second stage focused on identifying the most relevant MCIs and soil predictors while preventing model overfitting. A non-formal feature screening process was adopted, integrating multiple-criteria analyses to balance statistical robustness with environmental interpretability. Non-linear importance was assessed using the Random Forest (RF) algorithm to evaluate the relative contribution of each molecular connectivity index to log
Kd.50 Linear relationships were examined through Pearson's correlation coefficient (r) to evaluate both the strength of association between each predictor and log
Kd and the degree of multicollinearity among predictors.51 Joint predictive power was preliminarily evaluated using elastic net-regularized multiple linear regression (MLREN; α = 0.99; number of grid points = 150; grid scale = square root; and minimum penalty fraction = 0.001) to asses the combined predictive contribution of individual MCIs when integrated with soil properties.52 Multicollinearity was controlled by retaining only predictors with a variance inflation factor (VIF) ≤ 12. The final subset was selected based on a multi-criteria approach combining RF importance, strong correlation with log
Kd, strong predictive performance in preliminary MLREN models, and low multicollinearity (VIF ≤ 12), reflecting variables that were both statistically stable and mechanistically meaningful in describing PFAS–soil interactions. This exploratory process prioritized interpretability and reproducibility over exhaustive optimization; hence, no hyperparameter tuning was conducted. The retained predictors are summarized in Table S7.
The third stage involved model training and validation phase. Both the carboxylic-PFAS and PFAS-full datasets were randomly partitioned into 75% training and 25% validation subsets to ensure the balanced representation of compound types. The carboxylic-PFAS dataset was analyzed separately to provide a chemically controlled baseline, enabling direct comparison between the new MCIs framework and established Abraham solvatochromic descriptors that have been only developed and validated for carboxylic PFAS. This comparison validated the MCIs approach within a homogeneous compound group before extending it to the more structurally diverse PFAS-full dataset, which includes carboxylic, sulfonic, and flourotelomer sulfonates. Subsequently, linear and machine learning models were developed using the PFAS-full dataset to assess the predictive capabilities of MCIs across compound classes. Simple linear regression (SLR) models were trained using selected MCIs (i.e., VP-1, VP-4, VC-3, VC-4, VC-6, VPC-4, SP-3, SP-4, and SPC-4), as described in Table S9. MLREN was subsequently fitted using either the same predictors as SLR or an extended set including ASP-1, ASP-0, AVP-0 and soil properties, as shown in Table S10a. Additional machine learning algorithms including artificial neural networks (ANN), random forest (RF), support vector machine (SVM), extreme gradient boosting (XGB), and K nearest neighbors (KNN) were trained using multiple MCIs (i.e., SP-3, ASP-1, AVP-0, and ASP-0) and soil properties (Table S10a). All models were implemented using the default hyperparameter settings of their respective libraries, and no hyperparameter optimization or cross-validation was performed at this stage to ensure transparency and reproducibility.
Model performance was assessed across multiple metrics to capture goodness-of-fit, predictive accuracy, and generalization capacity. For linear models (SLR and MLREN), training fit was evaluated using R2 (SLR) or adjusted Radj2–for MLREN, while machine learning models (ANN, RF, SVM, XGB, and KNN) were assessed using R2 on training data. Predictive accuracy was quantified using Rpred2 on validation subsets for ANN, SLR, and MLREN. Lastly, generalization capacity on external data was tested using Rext2 for ANN and MLREN. Additionally, the root mean square error (RMSE) was calculated for all models to provide an absolute measure of prediction error, and the corrected Akaike information criterion (AICc) was applied exclusively for model selection among linear models (SLR and MLREN) rather than a performance metric. Detailed results are provided in Table S10b.53
To evaluate the model sensitivity and quantify uncertainty under realistic environmental variability, Monte Carlo simulations (N = 5000) were performed by randomly sampling predictors from their empirical distributions (exponential for SOC, gamma for CEC, and Johnson Sb for SP-3) derived from the PFAS training/validation dataset (Table S12). Simple random sampling was applied without explicitly preserving correlations between predictors, as the primary objective was to quantify the effect of univariate variability or main effects. Specifically, main effects represent the independent contribution of each predictor, assessed by varying predictors individually, while interaction effects were assessed by varying multiple predictors simultaneously to capture interdependencies. Averaging results across iterations provided estimates of both independent (main) and joint (interaction) influences on predicted log
Kd.55,56
Additionally, model robustness and optimization were conducted through targeted scenario analysis to evaluate model stability under extreme or adverse conditions and to explore optimization strategies. Robustness was tested by simulating boundary-minima predictor values and introducing random noise to predictions, while optimization of predicted log
Kd was performed using a composite desirability function (D; 0 ≤ D ≤ 1) representing ideal performance.57 This approach identified the optimal combination of predictor values (within the observed ranges) maximizing PFAS sorption (log
Kd). JMP 18.2 (JMPSD LLC, Cary, NC, USA) was used for data analysis and modeling.
Kd values for PFAS). The characterization of the training datasets used in this study reveals substantial variability in log
Kd behavior among PFAS subclasses, strongly influenced by molecular structure and functional groups. The distribution of log
Kd values for the carboxylic-PFAS (N = 327) was best described by a two-component normal mixture (N2M) provided with the lowest Bayesian information criterion (BIC), indicating the presence of two subpopulations with distinct sorption behaviors, primarily reflecting the differences between short- and long-chain carboxylic PFAS. In contrast, for the PFAS-full dataset (N = 699), a three-component normal mixture (N3M) achieved the lowest BIC, consistent with a broader diversity of compounds that exhibit varying sorption mechanisms. This statistical outcome suggests that the PFAS cannot be represented by a single distribution but rather form multiple subpopulations with distinct sorption mechanisms, driven by variations in chain length, degree of fluorination, and functional group chemistry. Across both datasets, log
Kd values spanned a broad range (−1.37 to 3.33 L kg−1) with high variability (CV = 200% and 205%), highlighting the diversity of environmental behaviors among compounds. The right-skewness of the distributions (mean = 0.49 > median = 0.34) suggest that low-sorption compounds are more prevalent, though a smaller subset exhibits markedly higher sorption. The independent external dataset (N = 628) displayed similar patterns, and log
Kd values also followed a three-component normal mixture (SD = 0.81) across a similar range (−1.15 to 3.57 L kg−1) but with lower variability (CV = 92%), as shown in Tables S5 and S6, Fig. S1–S3. Collectively, these results reveal that the PFAS sorption behavior is not uniform but structurally dependent, supporting the need for models that can capture this molecular and environmental complexity.Fig. 4a illustrates compound-specific log
Kd variability within each functional group (i.e., PFCAs, PFSAs, and FTS) and the relationship between C-chain and log
Kow (n-octanol–water partition coefficient). This figure underscores the pronounced variability not only between subclasses but also among individual PFAS within the same class. For example, PFBA, PFPeA, and PFHxA (C5–C6 carboxylic acids) show narrower distributions with lower median log
Kd values, while longer-chain PFCAs like PFOA and PFUnDa (C8–C11) and PFSAs like PFOS and PFDS (C9–C11) exhibit both higher and more dispersed log
Kd distributions. Similarly, FTS like 6
:
2 and 8
:
2 FTS display wide distributions. These patterns align with previous observations and reflect that molecular features such as chain length, branching, and functional group chemistry are critical determinants of sorption behavior and the soil property heterogeneity.9,49,58
Additionally, a clear positive trend was observed: log
Kd increased with the perfluorinated chain length and consequently molecular weight, consistent with the increase in hydrophobicity and stronger sorption affinity to soil. Short-chain PFCAs (C4–C6, MW ≈ 200–400 g mol−1) exhibited narrow log
Kd distributions and low median values (median log
Kd ≤ 0 L kg−1), attributed to weak hydrophobic interactions and limited retention by soil.9 In contrast, long-chain PFCAs (C7–C12, MW ≈ 500–600 g mol−1) and longer-chain PFSAs and FTS demonstrated significantly higher and more variable log
Kd values (median log
Kd > 2.0 L kg−1). However, PFSAs consistently showed higher log
Kd values than PFCAs of similar chain lengths, highlighting a stronger affinity of soil for sulfonic acid groups than for carboxylic groups, aligning with previous findings.35,47 Similarly, FTS generally followed the hydrophobicity trend dictated by the chain length, with 8
:
2 FTS exhibiting high log
Kd consistent with its relatively high log
Kow. In summary, the shorter chain length resulted in weaker attraction to soil, leading to more consistent log
Kd values. The broad data dispersion among the long-chain PFAS reflects greater intermolecular disparity, while variations within individual PFAS classes underscore differences in soil partitioning behaviors influenced by diverse soil properties, as reported in experimental conditions in the related literature.
The soil pH ranged from 3.5 to 8.0 (median pH = ∼7), consistent with the conditions spanning acidic to neutral soils. However, the soil pH was not included as a predictor since it was used to account for PFAS speciation (i.e., ionization when the pH exceeds their pKa) for adjusting the molecular predictors (Text S1). Since PFAS are predominantly anionic, the soil pH influences PFAS speciation.39,59 This affects both electrostatic interactions and the soil's surface charge because acidic soils reduce electrostatic repulsion and may enhance the sorption of anionic PFAS, while alkaline soils increase repulsion and may reduce sorption.35
Fig. 4b–e and S2 show the adjusted predictors (i.e., MCIs and Abraham predictors) that were used in this study. Fig. 4b–e show the distribution of 46 MCIs, which capture structural variability among PFAS. The descriptors were grouped by type—Simple Cluster (SC), Simple Path (SP), Valence Cluster (VC), Valence Path (VP), Simple and Valence Path-Cluster (SPC and VPC), and Average Simple Path and Cluster (ASP and AVP). Substantial variability was observed in lower-order indices (e.g., n = 1–3), particularly for simple and valence path descriptors, reflecting their sensitivity to differences in the molecular structure. In contrast, higher-order indices (n ≥ 4) exhibited narrower ranges, suggesting limited ability to differentiate among compounds. Pearson's correlation matrix analysis of the 46 MCIs and two soil properties revealed the following four main clusters of internally correlated predictors: (1) simple and valence clusters (SC/VC), simple and valence path clusters (SPC/VPC), simple and valence path (SP/VP) (r > 0.80 and p < 0.05); (2) average simple path (ASP-0 to ASP-5) (r > 0.60 and p < 0.05); (3) average valence path (AVP) (r > 0.80 and p < 0.05) and ASP-6/7 (r = 0.50 and p < 0.05); and (4) soil properties (r ≤ 0.50 and some p-values > 0.05) (Fig. S4). Cluster 1 was inversely associated with Cluster 2 (r ≤ −0.60 and p < 0.05), while Clusters 3 and 4 showed weaker positive relationships with the others (r < 0.60 and p < 0.05 and ≤0.9, respectively), indicating the distinct predictor groups. This analysis provided a rationale for MCI-based predictor selection based on the structural heterogeneity of PFAS in the dataset, prioritizing those combining high variability and moderate complexity for model development.
The Abraham descriptors, calculated for carboxylic PFAS only, also exhibited distinct ranges of variability (Fig. S2). Among these, S′ exhibited the widest variability, highlighting substantial differences in the dipolarity/polarizability of PFAS molecules, a property linked to non-specific polar interactions with soil surfaces. E′ and V′ exhibited moderate ranges, suggesting that excess molar refraction and molecular volume also vary significantly among carboxylic PFAS. Since all PFCAs have the same functional group (i.e., –COOH) in their structure, A′, B′, and J− values are similar across the database. These differences in descriptor dispersion are critical for variable selection, as high-variability predictors, such as S′, E′, and V′, are more likely to capture meaningful differences in PFAS sorption behavior.13
Kd. Between-group multicollinearity was managed by permitting combinations only if inter-group |r| was below 0.6.
The improved performance of the MCI model is likely attributed to the ability of VP-7 to better capture structural and electronic features, such as molecular size and polarizability, influencing PFAS sorption, than only van der Waals interactions represented by E′ in Abraham descriptors.13 VP-7 is a seventh-order valence path index that quantifies the connectivity of 8 atoms in a linear path and includes electronic information (Table S3).22 This predictor turned out to be the most important of the 46 MCI tested, indicating that the chain length influenced not only PFAS sorption (i.e., longer C-chains → higher hydrophobicity → higher log
Kd; Fig. S7) but also the –COOH group, which has a partial negative charge and a dipole moment that facilitates hydrogen bonding.22 In other words, we consider that VP-7 spans from the functional group into the tail, encoding how the molecule's polarizability and shape evolve with size. On the other hand, E′ captures non-specific interactions,13,20 which are important in aliphatic compounds such as PFAS, since they do not contain resonating π-electrons but lone-pair electrons.20,60 Fig. S7b shows that E′ values decreased with the increase in PFAS chain length, while log
Kd increased. This inverse trend suggests that the incremental addition of CF2 units decreases the polarizability of the fluorine tail of the molecule, which decreases its interaction with aqueous phases, facilitating sorption on the solid phase. Besides these mechanistic differences, Abraham descriptors are largely derived from small, neutral organic molecules, potentially limiting their applicability across diverse PFAS chemistries. Therefore, these results highlight the better applicability of MCIs than Abraham descriptors in modeling PFAS sorption for our datasets.
Kd and the nine MCIs identified as potential top individual predictors (p < 0.0001) (i.e., VP-1, VP-4, VC-3, VC-4, VC-6, VPC-4, SP-3, SP-4, and SPC-4) (Fig. S6). The SLR models with a simple path order 3, SP-3 (Rpred2 = 71.7%, Model S7), or a valence path order 6, VC-6 (Rpred2 = 71.5%, Model S5), demonstrated moderate predictive power, highlighting the importance of incorporating multiple MCIs or soil properties (Table S9). The discrepancy between top individual predictors selected by RF in the predictor screening process (VP-4 and VP-1; Table S7) and those selected in SLR models (SP-3 and VC-6; Table S9) highlights the RF's ability to capture non-linear relationships that linear models may overlook. When individual MCIs were combined with the soil properties via MLREN, SP-3 (Rpred2 = 77.9% for Model M7 in Table S10b) yielded a predictive accuracy improvement of over 6% compared to the best-fit SLR model (Model S7 in Table S9), demonstrating the importance of the soil features. Moreover, MLREN exhibited reduced overfitting, as indicated by smaller differences between the training and validation RMSE values (0.003–0.044 in Table S10) relative to those in the SLR models (0.007–0.057 in Table S9).The shift in the most important MCI from VP-7 in the PFCAs dataset to SP-3 in the PFAS dataset is noteworthy. VP-7 performed well with the structurally similar PFCAs, likely capturing subtle electronic differences in the fluorinated chains and head groups. Conversely, SP-3 proved more effective across the diverse 19 PFAS, likely by better differentiating compounds based on the chain length, branching, and overall compactness, irrespective of the specific functional groups. This suggests that higher-order valence indices capture molecular size and electron distribution nuances, crucial for accurate sorption predictions across diverse PFAS.
More complex models incorporating multiple MCIs (Models M10 and A11 in Table S10) were developed by adding two ASP indices and one AVP index to avoid multicollinearity with SP-3, SOC, and CEC. Model M10 evaluated SP-3, ASP-1, ASP-0, and AVP-0 as predictors (Rpred2 = 83.70%; RMSE = 0.413; and VIF ≤ 12), enhancing the predictive performance by >5% over the best-fit single-MCI MLREN (Model 7b in Table S10). The VIF values (0.8–12 for the MCIs; 0.8–0.9 for the soil properties) remained acceptable (≤12), confirming manageable multicollinearity51 (Table S10).
Equivalent six-predictor machine learning models were evaluated, resulting in ANN outperforming its counterpart RF (Rpred2 = 82.42%; RMSE 0.428; and N = 175), support vector machine (Rpred2 = 82.14%; RMSE 0.432; and N = 164), extreme gradient boosting (Rpred2 = 78.37%; RMSE 0.475; and N = 175), and K nearest neighbors (Rpred2 = 78.35%; RMSE 0.475; and N = 175) models. The ANN model using SP-3, ASP-1, ASP-0, AVP-0, SOC, and CEC as predictors (Model A11 in Table S10; a single hidden layer; three neurons; the hyperbolic tangent—Tan H—activation function; 20 boosting iterations; Fig. S5) yielded a higher predictive accuracy (Rpred2 = 84.9%; RMSE = 0.397; and N = 164) than MLREN (Rpred2 = 83.7%; RMSE 0.413; and N = 164). This finding suggests that the ANN model captured non-linearities and predictor interactions that the linear models missed. However, this increased complexity might not translate into better generalization. Table 1 summarizes the performance metrics of the best predictive models; the italic-row (Model M7) indeed represents our selected, most generalized model, chosen based on its optimal balance between high predictive accuracies (Rpred2) on validation datasets.
| Model type | Model No. | Predictor importance (highest → lowest) | T | V | |||
|---|---|---|---|---|---|---|---|
| Radj2 (%) | RMSE | Rpred2 (%) | RMSE | ||||
| a V: validation and T: training. *R2: goodness-of-fit on training data in A11. Rpred2: goodness-of-fit on validation data and Radj2: goodness-of-fit on training data. RMSE: root mean square error. | |||||||
| PFCAs | Abraham descriptors | — | E, SOC, and CEC | 80.2 | 0.51 | 79.8 | 0.47 |
| MCIs descriptors | — | VP-7, SOC, and CEC | 82.9 | 0.47 | 84.8 | 0.40 | |
| 3 PFAS subclasses | MCIs descriptors | M7 | SP-3, SOC, and CEC | 77.1 | 0.478 | 77.9 | 0.481 |
| M10 | SP-3, ASP-1, ASP-0, AVP-0, SOC, and CEC | 84.4 | 0.394 | 83.7 | 0.413 | ||
| A11 | SP-3, ASP-1, AVP-0, ASP-0, SOC, and CEC | 86.3* | 0.369 | 84.9 | 0.397 | ||
To contextualize our results, the modeling framework and predictive performance of our model were compared qualitatively with a few recent PFAS sorption modeling studies.49,61,62 As summarized in Table S14, our study stands out by using MCIs as PFAS descriptors. Unlike conventional physicochemical or geometric descriptors, such as molecular weight, hydrophobicity, solubility, or molecular size, MCIs represent atomic connectivity and branching patterns within the molecular graph. This allows them to encode structural information without relying on 3D conformations or experimentally derived physicochemical parameters, making them computationally efficient and broadly applicable across diverse PFAS structures. The predictive accuracy obtained in our study (R2 = 77.9–84.9% and RMSE = 0.40–0.50) achieved comparable R2 values (72–93%) and similar RMSE ranges (0.36–0.86), underscoring the robustness and transferability of the MCI-based modeling framework. However, direct numerical comparison across studies is not strictly appropriate, as the reported models relied on different descriptor types (e.g., molecular weight, log
Kow, and charge density) and distinct machine learning frameworks (e.g., RF and LGBM). Furthermore, the underlying soil datasets differ substantially among studies (Table S15). Our dataset encompasses soils with relatively low mean SOC (1.13%) and moderate CEC (16.3 cmol kg−1) compared with prior works (SOC = 2–4.7%; CEC = 16–19 cmol kg−1), while maintaining a comparable pH range (6–7). These variations in soil composition and experimental conditions make strict side-by-side comparison difficult but nonetheless highlight the practical advantage of MCIs as versatile predictors to evaluate the PFAS sorption behavior (log
Kd).
Kd (positive SHAP values), indicating that more complex structures may enhance PFAS sorption. Fig. S9 illustrates a perfect correlation between SP-3 and molecular weight across all PFAS subclasses (R2 = 1.00), confirming that SP-3 captured molecular size and chain structure, two primary drivers of PFAS sorption. This correlation also indicates that SP-3 reflects PFAS hydrophobicity. Therefore, SP-3 provides a unified metric that bridges both structural and physicochemical factors affecting sorption, explaining its superior performance in the PFAS-full dataset.
Similarly, SP-3 remained the predominant positive contributor in the SHAP plot of Models M10 and A11, as illustrated in Fig. 5b and c, although both had broader SHAP values than Model M7, reflecting their increased complexity. While SP-3 remained the dominant positive contributor, the inclusion of ASP-1, ASP-0, and AVP-0 introduced greater variance in predictor effects. These three low-variance variables disproportionately contributed to SHAP space—narrow but impactful SHAP bands and non-monotonic relationships (Fig. 5b). In other words, this model was sensitive to small differences in these variables, suggesting potential overfitting to dataset-specific noise.
In Model A11, as seen in the SHAP plot (Fig. 5c), the non-linear nature of the ANN model was manifested as more complex and asymmetric distributions. SP-3 again exerted the largest influence, with its high values (red) correlating with higher log
Kd. However, ASP-1, ASP-0, and AVP-0 still contributed non-trivially despite their minimal variability, indicating that the ANN captured interactions between the low-variance MCIs and dynamic soil features. However, the wide SHAP fluctuations for ASP-1 and AVP-0 highlighted the model's high sensitivity to low-signal inputs and risk of model instability. In contrast, SOC and CEC appeared to exert less overall impact on the model than the molecular descriptors. While low SOC values were associated with negative SHAP values (reduced sorption potential), their limited spread suggests a secondary influence relative to molecular features, aligning with previous findings.49,58 Similarly, CEC showed a relatively narrow effect range, with low CEC values contributing negatively to predictions. Overall, the model indicates that PFAS sorption was driven more by intrinsic molecular features, particularly connectivity-based descriptors like SP-3, than by soil characteristics in the current dataset.49,58 These results underscore the utility of MCIs in capturing the structural determinants of PFAS behavior in the environment.
Monte Carlo simulations (N = 5000) were leveraged to further decompose the total importance of each predictor into main (predictor independency) and interaction (predictor dependency) effects on log
Kd predictions within the PFAS data context. Across all three models, the general order of predictor importance was MCIs > SOC > CEC. Both the non-linear (ANN) and linear (MLREN) approaches converged on the primacy of SP-3, followed by ASP-1, but diverged on the relative importance of ASP-0 versus AVP-0 (ASP-0 > AVP-0 for Model M10; AVP-0 > ASP-0 for Model A11). The main effects accounted for 94–97% of the predictive power of Models M7, M10, and A11, whereas interaction effects contributed only 3–6%. Among all predictors, SP-3 had the largest share of the main effects raging from 68 to 95%, with minimal interaction contributions (1–2%). Other MCIs exhibited substantially smaller main effects, including ASP-1 (6–13%), ASP-0 (2–7%)), and AVP-0 (4–6%), each with 1% interaction contributions. Soil properties contributed modestly to predictive power, SOC accounting for 1–2% of the main effect and 0.4–1.1% of the interaction effect, while CEC explained only 0.02–0.2% of the main effect and 0.02–0.1% of the interaction effect. Overall, interaction effects were minimal across all predictors, indicating that model predictions were primarily driven by additive main effects rather than predictor interactions. The ANN's ability to capture these subtle interactions likely contributed to its higher validation Rpred2 than that of the inherently linear MLREN. However, the generalization capacity of the models remains to be tested to determine if these captured patterns are mechanistically plausible and generalized. Across all the MLREN models, the rates of change in log
Kd with a one-unit increase ranged from 0.16 ± 0.004 (Model M9; SPC-4) to 123.7 ± 7.5 (Model M10; ASP-1) for the MCIs (P < 0.0001), from 0.098 ± 0.016 (Model M1) to 0.12 ± 0.012 (Model M10) for SOC (P < 0.0001), and from 0.002 ± 0.001 (Models M1, M2, and M10) to 0.004 ± 0.001 (Models M9) for CEC (P < 0.05 in only four Models M3, M5, M6, M9) (Table S10a). The large positive slopes indicated high sensitivity of log
Kd to small changes in ASP-1 (123.7), ASP-0 (122.00), AVP-0 (51.5), VC-4 (23.0), and VC-6 (22.0) (Table S10a).
Table 2 summarizes the external predictive performance of models M7, M10, and A11 tested for the soil-only datasets and soil & sediments datasets. Across all cases, the model's performance increased when the combined soil & sediments datasets were tested. Specifically, Model M7 (italic-row) consistently achieves the highest R2 values and lowest RMSEs, particularly for the soil & sediments datasets (18 PFAS subset: R2 = 53.10%, RMSE = 0.56; 35 and PFAS subset: R2 = 52.40%, RMSE = 0.62), suggesting its robustness and applicability to more chemically diverse and environmentally complex datasets in comparison with Models M10 and A11. These findings emphasize that validation serves as a necessary condition for model viability, while external validation represents sufficient condition for real-world predictability: both are indispensable for robust forecasting. Moreover, the inclusion of sediment data appears to enhance the model performance by increasing the diversity and representativeness of predictor distributions, thus better capturing the complexity of PFAS sorption behavior. This is particularly important given the limited data available on the adsorption and desorption behaviors of long-chain and emerging PFAS in sediments.63–66
| Model no. | MCI predictor and soil properties | 18 PFAS subsets | 35 PFAS subsets | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Soil-only | Soil and sediments | Soil-only | Soil and sediments | ||||||
| Rext2 (%) | RMSE | Rext2 (%) | RMSE | Rext2 (%) | RMSE | Rext2 (%) | RMSE | ||
a 18 PFAS subsets: same families as those in the training and validation datasets (PFCAs, PFSAs, and FTS), except for 4 : 2 FTS. 35 PFAS subsets: representing a broader and more diverse set of chemical subclasses and emerging compounds. Rext2: goodness-of-fit on external data. RMSE: root mean square error. |
|||||||||
| M7 | SP-3, SOC, and CEC | 49.6 | 0.55 | 53.1 | 0.56 | 45.6 | 0.82 | 52.4 | 0.62 |
| M10 | SP-3, ASP-1, ASP-0, AVP-0, SOC, and CEC | 47.5 | 0.61 | 51.4 | 0.62 | 20.2 | 1.22 | 19.2 | 1.01 |
| A11 | SP-3, ASP-1, AVP-0, ASP-0, SOC, and CEC | 4.87 | 3.62 | 7.33 | 3.61 | 5.37 | 6.86 | 29.4 | 1.15 |
The underlying cause of this performance divergence was three-fold, stemming from the interplay between the data characteristics and model complexity. First, all the six predictors and log
Kd showed significant distributional shifts between the 19 PFAS (training/validation) and external validation data (Table S6). Second, SOC and CEC demonstrated greater variability in the external data (35 PFAS soil & sediments) than in the 19 PFAS data, given the values of SD (CEC: 26.39 vs. 11.98; SOC: 5.25 vs. 1.36), CV (CEC: 111 vs. 74; SOC: 165 vs. 121), range (CEC: 140.00 vs. 80.00; SOC: 37.60 vs. 7.70), and IQR (CEC: 18.00 vs. 14.50; SOC: 2.80 vs. 1.15) (Table S6). All the models trained on the narrower PFAS dataset ranges struggled to accurately generalize to these broader external conditions. Third, MCIs such as ASP-0, ASP-1, and AVP-0 displayed extremely low variance across all the datasets (CV = 1–4%), limiting their discriminatory power. The linear or non-linear exploitation of their non-generalizable fluctuations by Models M10 and A11, respectively, likely inflated their predictive accuracy but reduced their generalization capacity. In contrast, Model M7 avoided these pitfalls by focusing on SOC, CEC, and SP-3 with the substantial variance and stable influence—as also shown by the stable predictive accuracy of three-predictor (similarly parsimonious) Models M3–M9 (Rpred2 = 76.78–77.91; RMSE = 0.481–0.493 in Tables 1 and S10b). Thus, the predictive gains (increased complexity) from adding the multiple low-variance MCIs to Models M10 and A11 (despite meeting VIF criteria) did not translate into better generalization, thus highlighting the trade-off between model complexity and real-world reliability.
It is important to mention that our study and a recent publication61 used independent external datasets for evaluating the external generalization of the models, resulting in Rext2 values of 52.4% and 71.7%, respectively. In our case, prediction bias within the external dataset was primarily associated with samples exhibiting higher mean and median log
Kd values, indicative of stronger PFAS adsorption. This pattern likely reflects greater variability and higher mean SOC, suggesting soils richer in organic matter compared to the training and validation datasets. Similarly, CEC values in the external dataset displayed higher averages and extreme ranges, pointing to more diverse soil mineralogy and an abundance of charge sites (Table S6). Overall, the external dataset was more heterogeneous and encompassed more complex soil conditions, with elevated SOC and CEC correlating positively with higher log
Kd. These findings indicate that our model performs more accurately for soils containing low-to medium-sorbing PFAS, while prediction becomes more challenging under conditions of high organic carbon and cation exchange capacity.
Fig. 6 presents the regression equation of the highest external predictive performance across both soil-only and soil & sediments datasets (Table 2). The coefficients illustrate the combined roles of molecular connectivity index (SP-3), organic matter content (SOC), and cation exchange capacity (CEC) in controlling PFAS partitioning. SP-3, the dominant predictor, captures the effects of molecular connectivity, chain length, and hydrophobic surface area, which are structural attributes known to enhance sorption. SOC exerts a moderate positive effect, consistent with the established role of organic matter in promoting hydrophobic partitioning of PFAS.32 Although the CEC coefficient is comparatively small, it remains statistically significant and improves model stability and generalization by representing soil cation-exchange and ion-bridging mechanisms.32 Collectively, these predictors describe complementary molecular and environmental controls governing the PFAS sorption behavior.
Kd—corresponding to minimal PFAS mobility in environmental matrices such as soils and sediments—across all the 11 models via a composite desirability function (D), (ii) model uncertainty, and (iii) model robustness or stability under boundary minima via Monte Carlo simulations. Maximal PFAS sorption yielded a consistent positive correlation of log
Kd with both the soil and MCI predictors (Table S11). However, unlike the three-predictor models (e.g., Model M7), the six-predictor models (e.g., Models M10 and S11) produced extrapolated log
Kd predictions beyond their observed ranges, verifying their poor generalization. Although the single-MCI MLREN models yielded slightly lower maximum desirability scores (0.83–0.89) than the multiple-MCIs models (0.99), their superior generalization makes them more suitable for realistic environmental applications aimed at limiting PFAS transport. Notably, the soil conditions that yielded the highest predicted PFAS retention (SOC = 7.7% and CEC = 80 cmol kg−1) exceed typical background soil ranges (SOC < 5% and CEC < 30 cmol kg−1),67 highlighting the gap between optimal model predictions and practical field conditions.To propagate predictor uncertainty through Model M7, Monte Carlo simulations (N = 5000) resampling SOC, CEC, and SP-3 from exponential (σ = 1.119), gamma (shape = 1.66; scale = 9.51), and Johnson Sb (θ = 1.20; δ = 1.50; scale = 20.3) distributions, respectively, were performed (Table S12). These best-fit distributions were selected based on their PFAS (training/validation) dataset. Both Model M7 predictions (N = 638) and simulations (N = 5000) exhibited a three-component normal mixture distribution, reflecting the inherent complexity of PFAS sorption across the diverse environmental regimes. The simulations produced slightly higher mean (0.56 vs. 0.52) and median (0.56 vs. 0.50) log
Kd values, with reduced dispersion (SD: 0.81 vs. 0.88; IQR: 1.13 vs. 1.23), than the deterministic predictions of Model M7 (Table S12).
The simulations also exhibited an expanded prediction range (−1.65, 2.91) relative to that of Model M7 (−1.28, 2.56), revealing the underestimation of PFAS mobility/retention extremes by the model. These extremes are far more critical for risk management than central estimates. Therefore, there is a need for closer scrutiny of the three-component normal mixture distribution to account for high-impact tail behaviors. The low-sorption component (indicative of high PFAS mobility) on average shifted from −0.85 (95% CI: −0.91 and −0.79) to −0.28 (95% CI: −0.31 and −0.26), with variability doubling (σ: 0.26 → 0.50) and proportion tripling (π: 0.11 → 0.31). This suggests that PFAS mobility risks are higher under real-world conditions than predicted by the central tendencies of Model M7.68 Meanwhile, the mid-sorption component (indicative of moderate PFAS sorption) moved from 0.54 (95% CI: 0.48, 0.60) to 0.69 (95% CI: 0.67, 0.70), with tighter dispersion (σ: 0.70 → 0.46) and lower dominance (π: 0.79 → 0.51). The high-sorption component (indicative of PFAS retention in soil) decreased from 1.80 (95% CI: 1.75, 1.85) to 1.63 (95% CI: 1.60, 1.66), with increased spread (σ: 0.21 → 0.42) and higher probability (π: 0.11 → 0.18). In particular, the combined probability of extreme (low sorption + high sorption) outcomes nearly doubled from 21.5% to 49.2% in the simulation results (Table S12). In other words, the deterministic model (Model M7) underestimated extreme risks (PFAS mobility/retention) by approximately 184% (low-sorption) and 71% (high-sorption). These shifts underscore how deterministic models may significantly underestimate both the frequency and variability of extreme PFAS behaviors. This is a critical consideration for risk management and remediation prioritization, as underestimating low-sorption (high PFAS mobility) scenarios can cause inadequate containment strategies and greater environmental exposure risk, while overestimating high-sorption may lead to insufficient remediation due to inaccurate estimations. The incorporation of predictor uncertainty enables more balanced and informed decision-making, particularly when planning safeguards for high-impact, low-frequency outcomes.
Finally, boundary-minima robustness (stability) evaluates how well Model M7 performs under extreme, low-sorption conditions (representing a worst-case scenario for PFAS mobility) by fixing the predictors at their minimum observed values (SOC: 0.1%; CEC: 0.5 cmol kg−1; and SP-3: 4.9) and adding Gaussian noise (SD = 0.88) to log
Kd across 5000 simulations. The simulated mean (−1.30; 95% CI: −1.42 and −1.19) closely matched the model's predicted minimum (−1.30; 95% CI: −1.33 and −1.28), confirming unbiased predictions under extreme low-sorption conditions. The simulated SD (0.89) closely matched the noise magnitude (SD = 0.88), demonstrating linear and predictable error propagation and model stability at this operational boundary. The narrow 95% CI (0.05) further underscored the model's precision at this operating point. In other words, the model intercept and slopes gave a reliable estimate at this boundary, even under the noisy log
Kd measurement.
In summary, these scenario analyses reaffirm Model M7's stability for the central and boundary predictions while emphasizing the need for probabilistic outputs in actionable risk assessments, given its deterministic nature. To capture the full range of plausible outcomes, future studies should explore dynamic modeling to incorporate spatiotemporal variability in soil and PFAS properties. Embedding such dynamics into decision-support tools can help balance remediation costs with ecological and public health risks.
Kd across chemically diverse PFAS. Compared with prior QSAR and machine learning models, the proposed framework achieves comparable generalization while using a more parsimonious and interpretable predictor set. This parsimony enhances model transparency and practical usability in data-limited scenarios where detailed molecular descriptors or physicochemical measurements are unavailable. In particular, the most influential descriptor, SP-3, effectively captured dominant molecular determinants of sorption, accounting for 68–95% of main-effect variance through structural connectivity and hydrophobicity. The inclusion of soil organic carbon (SOC) and cation exchange capacity (CEC) further improved external stability and mechanistic realism, representing hydrophobic partitioning and cation-bridging pathways, respectively. Although CEC exhibited a smaller numerical coefficient, it remained statistically significant and improved the generalization performance, reinforcing its mechanistic relevance for anionic PFAS.
While the three-predictor MLREN model (Model M7) exhibited moderate external generalization (Rext2 = 52.4%) relative to internal validation (Rpred2 = 77.9%), its balanced combination of interpretability, simplicity, and robustness makes it a practical screening-level framework for rapid PFAS sorption assessment. The ANN models achieved higher accuracy during training/validation by capturing non-linear interactions but displayed reduced external reliability, highlighting the trade-off between model complexity and generalizability. Model predictions should therefore be interpreted within the context of uncertainty and chemical domain coverage, particularly for underrepresented PFAS subclasses.
Future research should focus on improving model generalizability and applicability by expanding the PFAS and soil datasets, incorporating additional physicochemical descriptors, soil composition variables, and spatiotemporal dynamics to better reflect the complexity of PFAS behavior and further enhance the model's generalizability, robustness, and potential applications in environmental risk assessment.
| AICc | Corrected Akaike information criterion |
| ANN | Artificial neural network |
| C# | Number of carbon atoms in the molecule (e.g., C4 or C8) |
| CEC | Cation exchange capacity |
| FTS | Fluorotelomer sulfonates |
| KNN | K-nearest neighbor |
| Koa | Octanol–air partitioning coefficient |
| Kow | Octanol–water partitioning coefficient |
Log Kd | Soil–water partitioning coefficient |
| LSER | Linear solvation energy relationship |
| MCIs | Molecular connectivity indices |
| ML | Machine learning |
| N | Sample size |
| PFAS | Per- and poly-fluoroalkyl substances |
| PFCAs | Perfluorocarboxylic acids |
| PFSAs | Perfluorosulfonic acids |
| QSPR | Quantitative structure–property relationship |
| QSAR | Quantitative structure–activity relationship |
| Radj2 | Adjusted coefficient of determination on training dataset |
| Rpred2 | Predicted coefficient of determination on validation dataset |
| Rext2 | Predicted coefficient of determination on independent external dataset |
| RF | Random forest |
| SHAP | Shapley additive explanations |
| SMILES | Simplified molecular-input line-entry system |
| SOC | Soil organic carbon |
| SP-3 | Simple-path order 3 |
| SVM | Support vector machine |
| VIF | Variance inflation factor |
| VP-7 | Valence-path order 7 |
| XGB | Extreme gradient boosting |
| This journal is © The Royal Society of Chemistry 2026 |