Characterizing Chemical Toxicity for Life Cycle Assessment Using Machine Learning and Deep Learning Models Based on Environmental Footprint – Methodological Comparison & textile case study
Abstract
The rapid expansion of registered chemicals, coupled with persistent data gaps, poses a major challenge for toxicity assessment in Life Cycle Assessment (LCA) and Safe and Sustainable by Design (SSbD). This study proposes a data-driven framework to directly predict toxicity characterization factors (CFs) from molecular Simplified Molecular Input Line Entry System (SMILES), using the Environmental Footprint (EF) v3.1 database as the training benchmark. We systematically evaluate five machine-learning (ML) and deep-learning (DL) approaches—random forest, XGBoost, Gaussian process, deep neural networks (DNN), and graph neural networks via message-passing neural networks (MPNN)—across three molecular representations: handcrafted physicochemical descriptors (Mordred), molecular graphs, and large-scale pretrained molecular embeddings (GROVER). Predictive performance is strongly target-dependent, with ecotoxicity CFs showing consistently higher predictability (R² = 0.57–0.67) than human toxicity CFs (R² = 0.40–0.60). Mordred-based ML models, particularly XGBoost, exhibit robust performance across multiple targets, while graph-based GNN models—especially multi-target MPNNs trained on graph-only representations—achieve comparable or, for several ecotoxicity targets, superior performance. GROVER embeddings reach competitive performance primarily when coupled with DL architectures. These results demonstrate that graph-based and pretrained molecular representations can effectively capture complex structure–toxicity relationships, reducing reliance on manual feature engineering. The framework further integrates applicability domain analysis and chemical clustering to enable domain-consistent, optimized model selection. A textile-sector case study illustrates how predicted CFs for chemicals previously without, can be incorporated into LCA, revealing that excluding toxicity impact due to missing CFs can lead to substantial underestimation of toxicity impacts—by up to an order of magnitude in the examined case.
Please wait while we load your content...