Machine learning prediction of DOC–water partitioning coefficients for organic pollutants from diverse DOM origins†
Abstract
This study aims to improve predictions and understanding of dissolved organic carbon–water partitioning coefficients (KDOC), a crucial parameter in environmental risk assessment. A dataset encompassing 709 datapoints across 190 unique organic pollutants and various types of dissolved organic matter (DOM) was compiled. Molecular descriptors were calculated to characterize each compound's properties and structures using Multiwfn, PaDEL and RDKit. Individual machine learning models were established for four different DOM origins: all DOM, natural aquatic DOM, natural terrestrial DOM and commercial DOM. These models exhibited excellent goodness-of-fit, internal stability, and predictive performance with Rtrain2 > 0.771, Rvalid2 > 0.602, Rtest2 > 0.629, and RMSEtest ranging from 0.413 to 0.580. Shapley additive explanation analysis identified CrippenLogP and MATS2m as the most influencing factors. CrippenLogP, reflecting hydrophobicity, positively influenced KDOC, while MATS2m, characterizing molecular branching and compactness, had a negative effect. Mor29m, where lower values indicate a higher abundance of heteroatoms such as halogens, also showed a negative impact, likely due to enhanced interactions with polar DOM groups. SlogP_VSA1, another descriptor related to hydrophobicity, demonstrated a positive correlation with log KDOC in natural aquatic DOM, while its negative correlation in all DOM may reflect the great diversity of DOM properties in that group. Partial dependence plots revealed that when CrippenLogP > 6, Mor29m between 0.45 and 0.52, MATS2m < −0.015, and SlogP_VSA1 < 7, organic pollutants tended to partition more into DOM. These findings support the application of machine learning models for assessing pollutant interactions with DOM, contributing to improved environmental risk predictions.