Machine-learning prediction of DCAA and TCAA concentrations in drinking water
Abstract
Drinking water disinfection by-products (DBPs) are of significant concern due to their carcinogenic, teratogenic, and mutagenic properties, making real-time monitoring essential for ensuring water safety. However, the typically low concentrations of DBPs and the high cost and complexity of conventional detection methods have led researchers to increasingly turn to predictive modeling using easily measurable water quality parameters. This study systematically evaluates the feasibility of machine learning (ML) methods in predicting the concentrations of dichloroacetic acid (DCAA) and trichloroacetic acid (TCAA): multiple linear regression (MLR), while computationally efficient, is limited by its linear assumptions and exhibits poor predictive performance (test set N25 = 23–54%, R2 = 0.353–0.640). Support vector regression (SVR), leveraging kernel functions, provided only marginal improvement (N25 = 46–69%, R2 = 0.442–0.595). The backpropagation neural network (BPNN) significantly enhanced prediction accuracy through flexible configuration of the hidden layer structure, number of nodes, and activation functions. For DCAA and TCAA, with one hidden layer and 15 nodes, BPNN outperformed both MLR and SVR (test set N25 = 89%, R2 = 0.850). Nevertheless, BPNN still suffers from inherent limitations, such as slow convergence due to a fixed learning rate and a tendency to converge to local optima caused by random initialization. To address these issues, this study introduced particle swarm optimization (PSO) to globally optimize the weights of BPNN, further increasing the prediction accuracy to over 98%. The results demonstrate that high-precision prediction can be achieved using only eight conventional water quality parameters, offering an economical, convenient, and reliable technical approach for monitoring DBPs in water supply systems.