Open Access Article
Ece Yıldıza,
Mustafa Şen
a and
Mehmet Akif Özdemir
*ab
aDepartment of Biomedical Engineering, Izmir Katip Celebi University, Izmir, Turkiye. E-mail: makif.ozdemir@ikcu.edu.tr
bBiomedical Research in AI & Neuroscience Laboratory, Izmir Katip Celebi University, Izmir, Turkiye
First published on 8th June 2026
This study introduces a machine learning (ML)-enhanced smartphone application designed for the precise colorimetric quantification of pH strips. To ensure the robustness of the system against environmental variations, a comprehensive dataset was constructed by capturing images of pH strips under diverse illumination conditions and camera angles. Following region of interest extraction, an initial set of 33 colorimetric features was employed to train and evaluate 15 different regression models. To ensure model interpretability and computational efficiency, a SHapley Additive exPlanations (SHAP)-based analysis was implemented, successfully identifying six critical descriptors (including color channel skewness, entropy, and intensity metrics) that primarily govern the pH prediction. The best-performing model (R2 = 0.99) was subsequently integrated into a user-friendly Android application, pHScoper. This application enables image capture, interactive cropping, and offline, on-device quantitative analysis without cloud reliance. Overall, the developed platform demonstrates strong potential for reliable, low-cost pH measurements in resource-limited settings.
To enable this digital quantification, smartphones have emerged as an ideal platform, owing to their advanced computational capabilities and embedded high-resolution cameras as optical sensors.9 This technological convergence has facilitated the integration of colorimetric detection into portable analytical systems across diverse biological, chemical, and healthcare applications.10–14 In practice, smartphone-based systems process captured images to extract specific quantitative descriptors, primarily utilizing color spaces such as RGB (Red, Green, Blue) and HSV (Hue, Saturation, Value).15 By correlating these image-derived features with analyte concentrations, robust calibration models can be established, effectively translating complex chromatic responses into precise quantitative measurements.16,17
Unlike conventional rule-based or simple analytical methods, which struggle to model the highly non-linear and non-monotonic colorimetric responses of multi-pad strips,18 artificial intelligence (AI) architectures are inherently equipped to map these complex feature interactions, enabling accurate, continuous pH interpolation between discrete integer values. In scenarios requiring fast, portable, and efficient analysis, these data-driven approaches play a pivotal role in enhancing colorimetric detection. Consequently, smartphone-based colorimetric methods employing such algorithms have gained popularity due to their affordability, adaptability, and portability.19–21 AI-driven colorimetric detection has been successfully applied across environmental, healthcare, and food safety fields. For instance, Mutlu et al.19 used smartphone images of pH strips as a training set for a Least Squares Support Vector Machine (SVM) classifier to evaluate illumination effects. Mercan et al.22 introduced GlucoSensing, a machine learning (ML)-based portable µPAD platform for glucose quantification from a smartphone. Similarly, Feng et al.23 developed a nanosensor using convolutional neural networks for glucose detection, and Liu et al.24 applied ML to microfluidic paper strips to detect salivary uric acid. These AI-enhanced colorimetric strategies enable low-volume, on-chip analysis, transforming subjective, semi-quantitative assays into reliable quantitative methods. Nevertheless, most AI models operate as “black boxes”, where only the final predictions are observable, making it challenging to quantify the information embedded in the inputs.25–27 To address this limitation in scientific and medical contexts, explainable AI (XAI) methods such as SHapley Additive exPlanations (SHAP)28 are increasingly applied to interpret model outputs and identify the most important features. In the context of colorimetric analysis, feature-attribution techniques are particularly relevant, as they quantify and visualize the contribution of image-derived features to model predictions.29 By ranking and aggregating Shapley values, SHAP-based interpretation may facilitate evaluation of whether predictive performance is driven by physically meaningful colorimetric descriptors rather than by illumination variations or background interference.
The strategic integration of explainable AI into smartphone-based colorimetric systems is still largely unexplored, despite the acknowledged necessity for interpretability. Alongside transparency, feature importance assessment is a crucial tool for model optimization in this field. SHAP enables the identification and elimination of redundant or noise-sensitive variables by measuring the precise contribution of particular colorimetric descriptors. This targeted feature selection is highly advantageous for smartphone platforms, as it facilitates the development of lightweight ML models for efficient, on-device computation without the need for high computational power. Furthermore, isolating physically meaningful color features ensures robust and consistent analytical performance across varying environmental and illumination conditions. Addressing these gaps, this study introduces pHScoper, a portable and user-friendly Android application that combines SHAP-based feature optimization with ML-driven colorimetric analysis for quantitative pH determination, providing a reliable point-of-care tool for resource-limited settings.
To better reflect real-world operational environments, strict hardware-based color calibration and lux measurements were intentionally omitted during image acquisition. Instead, experimental variability was systematically introduced to evaluate the computational robustness. As shown in Fig. 2, images were captured under diverse, uncalibrated indoor and outdoor conditions. For indoor setups, laboratory fluorescent bulbs (Philips 12 W) provided a neutral-to-cool illumination at approximately 4000 K. Outdoor images were acquired under partially clouded daylight (∼5000–6500 K) to represent diffuse natural lighting. To simulate practical scenarios involving multiple light sources, dual-illumination configurations were also evaluated by combining these ambient sources with the smartphone's integrated LED flashlight (∼5000–6500 K).20 Alongside these varying illumination conditions, five camera angles (center, left, right, leftmost, and rightmost) were employed across 15 discrete pH values, resulting in 300 distinct image capture configurations. While the distance between the smartphone camera and the pH strip was kept constant to ensure scale consistency, collecting a diverse, uncalibrated dataset ensures that the proposed framework can computationally compensate for complex illumination variations without requiring external calibration targets from the end-user. Additionally, to assess device-dependent variability, an additional independent image set corresponding to a full pH range (0–14; examples are presented in SI Fig. S1) was acquired using a second smartphone (Huawei Mate 10 Lite) under varying illumination conditions and camera angles.
Fig. 2 illustrates the images of pH strips acquired under four illumination conditions across different pH levels. To quantitatively analyze the variations of color associated with pH, RGB values were converted into HSV and CIELAB color spaces. To ensure a comprehensive analysis, features were extracted from these multiple color spaces. Statistical parameters, including mean, skewness, and kurtosis, were included to characterize the distribution properties and asymmetry of color intensity values within the multi-pad sensing region. Additionally, texture- and intensity-based features (such as contrast, correlation, energy, entropy, homogeneity, and mean intensity) were incorporated to capture spatial heterogeneity and inter-pad transitions. A total of 33 complementary descriptors were chosen due to their proven physical relevance in image-based chemical sensing applications.18 This initial comprehensive feature set served as a base for the subsequent feature importance analysis, which was utilized to identify the most critical descriptors and eliminate redundant variables.
| Category | Model | Number of parameters | Size (KB) |
|---|---|---|---|
| Linear models | Linear regression/efficient linear regression | 34 | ∼0.27 |
| Trees | Coarse tree/medium tree/fine tree | ∼175/375/495 | ∼1/∼2/∼2.7 |
| SVM | Linear SVM | 17 647 |
∼138 |
| Quadratic SVM | 9827 | ∼76.8 | |
| Cubic SVM | 9011 | ∼70.4 | |
| Gaussian SVM | 9759 | ∼76.2 | |
| Kernel (exp.) SVM | 2048 | ∼16 | |
| Kernel methods | Kernel least-squares | 2048 | ∼16 |
| Ensemble methods | Bagged trees | 13 380 |
∼105 |
| Boosted trees | 3150 | ∼25 | |
| Neural networks | Narrow neural network (10) | 351 | ∼2.7 |
| Wide neural network (100) | 3501 | ∼27 |
The evaluated regression models exhibit distinct characteristics in modeling colorimetric pH behavior. Linear models assume proportional relationships between extracted colorimetric features and pH values, providing interpretability and computational efficiency but limited expressiveness for modeling nonlinear color transitions across the full pH range. Decision tree-based models partition the feature space through rule-based splits, enabling nonlinear mapping; however, coarse variants may underfit complex calibration behavior. Ensemble methods, including bagged and boosted trees, improve predictive stability through model aggregation, contributing to variance reduction and enhanced robustness to illumination-induced variability. Kernel-based SVM models address nonlinear feature-target relationships by projecting features into higher-dimensional spaces, with performance influenced by regularization parameters and kernel selection. Neural networks are capable of capturing higher-order feature interactions inherent in colorimetric responses, offering flexible nonlinear approximations well-suited for representing pH-dependent color transitions. Model selection was guided not only by predictive performance but also by computational efficiency and suitability for lightweight mobile deployment.
The dataset was partitioned into a training set (80%) and a test set (20%). A 10-fold cross-validation procedure was applied to the training subset during model development and selection, while the independent test subset was used for final performance evaluation. The labeled data used for supervised regression consisted of 33 colorimetric features as input variables and their corresponding pH values as continuous target outputs.31,32 Feature importance analysis was subsequently conducted on the selected regression model using the SHAP method. The resulting SHAP values were used to estimate the contribution of individual features to the model output and to rank colorimetric features according to their relative importance. This ranking was further used for feature reduction by selecting the most influential features and retraining models with reduced feature sets to determine whether similar predictive performance could be maintained with fewer input features.
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
The efficient linear regression and coarse tree models yielded the lowest R2 values and were therefore excluded from this task. SVMs and ensemble-based methods demonstrated better performance. The wide neural network (WNN) achieved the highest predictive performance (R2 = 0.99) and was consequently selected for further analysis. The variability of the R2 values across cross-validations was represented by error bars, as shown in Fig. 4. The central bar corresponds to the mean R2 value for each regression model, while the upper and lower limits of the error bars represent the maximum and minimum R2 values obtained during 10-fold cross-validation. Based on the error bars, the highest consistency was achieved by the WNN model, whereas the largest variation was reflected by the coarse tree model. The relatively narrow error bars observed for the WNN model indicate stable predictive behavior across validation folds and support its robustness against overfitting.
A feature importance analysis was carried out to quantify the relative impact of each feature on the regression model's output. Moreover, this analysis can improve model interpretability and predictive performance.33 Therefore, the best-performing model was interpreted using SHAP values in Python. Among the 33 extracted colorimetric features, the model identified G-skewness, a-skewness, V, V-kurtosis, R-skewness, and entropy as the most influential features for the colorimetric determination of pH, as shown in Fig. 5a. Based on the plots in Fig. 5(bi–bvi), the skewness of the green channel exhibited a pronounced peak under highly basic conditions, reflecting enhanced asymmetry in pixel intensity distributions at high pH values. A secondary peak was also observed around pH 4, suggesting increased asymmetry in the acidic range. The skewness of the red channel showed a similar pattern, indicating that variations in skewness within these channels contribute to the model's capacity to differentiate between pH levels. In contrast, the skewness of the a channel gradually decreased with increasing pH, indicating that lower values of this feature are characteristic of more basic conditions and therefore contribute to higher predicted pH values, following an approximately linear trend. The V channel showed a local maximum around pH 3 and a pronounced global maximum near pH 10, after which it decreased toward very high pH values. This variation reflects changes in brightness in the colorimetric response and provides a discriminative feature for identifying alkaline conditions. The kurtosis of the V channel exhibited a distinct peak around pH 3 while remaining relatively stable across the rest of the pH range. Finally, entropy remained relatively stable across most pH levels, with a local minimum around pH 10 and a global maximum near pH 12, reflecting localized changes in pixel intensity distributions at strongly basic pH levels.
Relative importance (RI) metrics based on SHAP values were calculated to quantify the contribution of each feature to the model predictions and to identify the most informative descriptors governing the colorimetric response. This analysis identified six features as the most influential contributors to the model predictions. To evaluate the impact of feature reduction, feature subset experiments were conducted in which different combinations and numbers of input features were tested against the full 33-feature model (Fig. 6a), which was defined as the baseline with an RI of 100%. In contrast, models trained on reduced feature subsets exhibited noticeably lower R2 values. For instance, the reduced feature subsets of the individual V-channel descriptors (Fig. 6b) and texture-only features (Fig. 6c) resulted in substantial performance degradation, corresponding to RI of only 14.2% and 13.4%, respectively. The decline in R2 was most pronounced when limited channel feature subsets were used, as shown in Fig. 6b and c. Based on their RI values, texture-only features provided a higher R2 and a lower MAE compared to the individual V-channel descriptors, despite having a lower RI. Models using optimized subsets of four (Fig. 6e) and six (Fig. 6f) influential features still achieved strong R2 values of 0.920 and 0.969, respectively, corresponding to the RI of 29.9% and 39.6%, although slightly lower than that of the full feature configuration. Similarly, the model trained exclusively on skewness-based features (Fig. 6d) achieved an improved R2 value of 0.948, with a RI of 37.7%. These results demonstrate that although RI values decreased when fewer features were used, the skewness-only, top four-feature, and top six-feature subsets retained most of the predictive information contained in the full 33-feature model. These results demonstrate that a reduced set of physically meaningful colorimetric features can maintain predictive performance while enhancing computational efficiency, supporting the implementation of lightweight, robust smartphone-based sensing.
As a final validation step, the cross-device generalization of the developed application was evaluated using the independent dataset acquired from the second smartphone. Because these images were strictly excluded from the model development phase, they served as an external test for cross-device generalization. The deployed model maintained highly accurate predictions (R2 = 0.97, MAE = 0.48, RMSE = 0.59; the regression plot is presented in SI Fig. S2), confirming that pHScoper effectively tolerates variations in different smartphone camera sensors and internal image processing pipelines.
To evaluate the computational efficiency of the proposed lightweight deployment strategy, the original 33-feature model and the SHAP-optimized 6-feature model were quantitatively compared within the smartphone application. To ensure a fair comparison, both models were converted to TensorFlow Lite format and benchmarked under identical experimental conditions, including the same image resolution and preprocessing workflow. As summarized in SI Table S1, the SHAP-optimized model reduced the input features from 33 to 6 (an 81.8% reduction in dimensionality) and decreased the TensorFlow Lite model size from 16 KB to 6 KB (a 62.5% reduction). Despite the smaller model size, both exhibited comparable runtime memory usage (∼0.18 MB), indicating that the fixed TensorFlow Lite runtime overhead and image preprocessing operations dominate the overall memory footprint. Notably, evaluated over 1000 repeated runs on the Android platform, the optimized 6-feature model achieved an average inference time of 0.06 ± 0.02 ms, firmly supporting its applicability for real-time mobile deployment.
The pH-dependent variations of the top six features (Fig. 5b) highlight a nonlinear relationship between the extracted image features and pH, resulting in feature-specific response patterns. This nonlinear calibration behavior underscores the limitations of conventional pH estimation methods, which may not be well-suited to capture the complex mappings between image-derived descriptors and chemical responses on the pH strip, thereby motivating the use of ML-based regression approaches.18 At very low and very high pH values, the slightly increased prediction variability is closely related to the nonlinear response behavior of the indicator dyes near the boundaries of their effective transition ranges. Because colorimetric pH indicator dyes operate through protonation–deprotonation equilibria, highly acidic or basic conditions cause the dyes to reach chemical saturation. These dominant color states exhibit reduced sensitivity to incremental pH changes, thereby limiting color discrimination near the pH extremes.41 This chemical limitation directly contributes to the comparatively higher variance observed near the pH boundaries in Fig. 6, despite the model's overall strong predictive performance. To enable continuous pH prediction, regression modeling combined with feature importance analysis was employed. Feature importance analysis further enhances interpretability by quantifying the contribution of each feature to the model output. In particular, SHAP-based analysis enables post-hoc interpretation of trained models, providing insight into the relative importance and directional influence of colorimetric features.42 The comparable predictive performance of the skewness-only, top four-feature, and top six-feature subsets relative to the full 33-feature model indicates that the skewness of the RGB, HSV, and CIELAB color channels, together with the V channel, its kurtosis, and entropy, play a dominant role in capturing pH-dependent color transitions on the pH strip (Fig. 6d–f). These subsets achieved RI values of 37.7%, 29.9%, and 39.6%, respectively, demonstrating that reasonable colorimetric pH prediction performance can be maintained with reduced feature dimensionality. These findings highlight the importance of feature selection and demonstrate the potential for further model simplification while preserving predictive performance.
Although individual indicator pads appear visually uniform, the analyzed ROI in this study encompassed four distinct pads, each formulated to exhibit different color responses across the pH scale. Consequently, texture-based descriptors (e.g., contrast, correlation, entropy) are not merely capturing image noise or random local variations. Rather, they physically quantify the spatial color gradients and inter-pad transitions that dynamically change as each pad responds differently to a given pH level. While the SHAP analysis indicates that color distribution features are the dominant predictors (Fig. 6c), these texture metrics capture complementary spatial context regarding the multi-pad chemical reaction. Furthermore, capturing images under varying positions and illumination conditions introduced the necessary dataset variability to rigorously evaluate model robustness. This variability directly impacted algorithm performance across different data splits; for instance, the larger variation in R2 observed for the coarse tree model during cross-validation (Fig. 4) reflects its inherent sensitivity to fold-specific feature distributions. Because coarse decision trees rely on a limited number of rigid, threshold-based splits, minor variations in image-derived features across data folds can significantly alter the learned decision boundaries, leading to inconsistent predictive performance.43 In contrast, the WNN model exhibited greater stability and lower variance. This improved tolerance to heterogeneous acquisition conditions stems from the network's capacity to effectively model complex, nonlinear interactions among colorimetric features, ensuring robust performance even under varying illumination.44
The findings of this study are also consistent with recent research on AI-based colorimetric pH sensing using smartphone cameras or other portable sensing systems. More recently, paper-based diagnostic platforms have increasingly incorporated computational image analysis and advanced sensing strategies to improve assay sensitivity, robustness, and quantitative interpretation. For instance, Du et al. introduced a rapid deep learning-based quantitative lateral flow assay integrating residual neural networks and temporal modeling to enable reliable quantification within the early stages of assay development, substantially reducing interpretation time while maintaining high quantitative accuracy.45 Similarly, Park et al. developed a smartphone-assisted deep learning framework combined with a bioengineered enrichment strategy to improve the sensitivity of lateral flow assays and enhance the interpretation of weak colorimetric responses in noninvasive HIV screening.46 Lee et al. reported a deep learning-assisted point-of-care diagnostic framework that combines image-based region-of-interest detection with sequential analysis for rapid assay interpretation, demonstrating the potential of computational support for improving analytical performance in paper-based tests.47 Similarly, Han et al. introduced a deep learning-enhanced vertical flow paper-based assay incorporating nanoparticle amplification and computational analysis to achieve highly sensitive quantitative detection.48 Although these studies target different analytes and sensing formats, they demonstrate the growing transition of paper-based assays from subjective interpretation toward more robust quantitative systems. In this context, the proposed framework contributes to these ongoing efforts by integrating quantitative image analysis with ML-assisted pH prediction. While many of these existing studies rely on classification approaches or directly utilize raw RGB measurements, the present study specifically adopts a regression-based approach to enable more precise, continuous pH estimation. Unlike classification methods that merely assign discrete pH labels, our regression framework successfully captures finer intermediate variations in the colorimetric response. In addition, this study incorporates engineered colorimetric descriptors from multiple color spaces and evaluates their relative importance using SHAP-based feature analysis. Table 2 summarizes selected representative studies.
| Reference | Sensor | Method | Dataset | Features | Performance |
|---|---|---|---|---|---|
| a LS-SVM = Least-squares SVM; KNN = K-nearest neighbors; CNN = convolutional neural network; VGG = visual geometry group; AI-WMCS = AI-assisted wearable microfluidic colorimetric sensor; GRU = gated recurrent unit; DT = decision tree; WNN = wide neural network. | |||||
| 19 | Smartphone + pH strips | LS-SVM | N/A | Raw RGB values | Classification accuracy = 100% |
| 36 | Smartphone + pH strips | None (color adaptation) | N/A | CIE 1976 u′v′ color space | Error ≈ 0.25 pH units |
| 37 | Smartphone + pH strips | KNN | 2689 experimental samples | RGB color features | R2 ≈ 0.99 |
| 38 | Microneedle patch + web | CNN (VGG16) | 1466 images | RGB image data | ACC = 0.98 |
| 39 | AI-WMCS | 1D-CNN–GRU | Real and artificial tear samples | Multi-channel color patches | R2 ≈ 0.99 |
| 40 | Smartphone camera | Decision tree | 1025 data points (pH 4–8) | RGB, hue, saturation, luminance, grayscale, coloration | ACC = 0.67 |
| This study | Smartphone + pH strips | WNN | 1035 images | RGB, HSV, CIELAB features | R2 ≈ 0.99 |
This study presents several advantages beyond predictive performance. First, it covers the complete pH range from 0 to 14, whereas prior studies have reported improved precision when estimation is restricted to narrower ranges. For instance, Kadian et al.38 achieved high precision within a pH range of 2–12, while Alhaqi et al.40 reported balanced F1-scores associated with relatively moderate precision values within the pH range of 4–8. Extending such approaches to broader pH ranges may require additional methodological adjustments. A second advantage lies in the composition and diversity of the dataset. While Elsenety et al.37 reported high prediction accuracy using a larger dataset acquired under controlled illumination, the present work intentionally incorporates challenging acquisition settings, such as camera angle and natural illumination, to better reflect real-world scenarios. Third, nonlinear colorimetric behavior is explicitly evaluated through comprehensive feature importance analysis. While ML has been widely applied in related studies,19,37,38,40 explicit feature importance analyses have not been systematically reported, with model behavior primarily addressed in an end-to-end manner. To build upon earlier work, our study integrates feature importance analysis to interpret nonlinear and feature-specific responses of individual colorimetric features across the continuous pH range, with model inference performed locally on the mobile device. This study reduces reliance on specialized hardware by using a standard smartphone camera and commercially available pH strips, thereby lowering system cost and complexity. In contrast, methods based on fixed imaging systems36 or custom hardware38,39 may improve measurement consistency but often rely on additional hardware configurations that limit portability and scalability. The present approach demonstrates reliable pH estimation under practical conditions, supporting field-deployable and portable applications.
Despite the high predictive performance and successful on-device integration, several limitations should be acknowledged. The colorimetric features were derived exclusively from MQuant® pH strips. Because indicator compositions and resulting colorimetric responses vary across commercial manufacturers, the generalizability of the proposed model to other strip brands remains an open question. To address this, future iterations will implement transfer learning strategies. Under this approach, the current architecture can serve as a pre-trained base model. By fine-tuning this model using smaller, brand-specific supplementary datasets acquired under similar conditions, the system could efficiently adapt to different indicators without the computational and practical burden of complete model redevelopment. Although the proposed framework demonstrates robustness to moderate camera-angle variations included in the training data, explicit perspective-normalization algorithms were not incorporated into the current preprocessing pipeline. Because the ROI extraction relies on controlled orientation, severe angular distortions or completely unrestricted strip placements could still negatively impact predictive performance, representing a limitation of the current study. To further enhance operational reliability in unconstrained real-world environments, future iterations of the mobile application will integrate algorithmic geometric correction and automated ROI detection—defining the reaction zone dynamically rather than relying on predefined geometric shapes. While the current study focuses on colorimetric pH estimation, the implemented framework is adaptable to other biochemical assays, such as ammonia, lactate, or uric acid. By adapting the ROI templates and retraining the model with analyte-specific datasets, the current pH system serves as a base model for cost-effective and extensible quantitative analysis. To achieve this, future datasets for new colorimetric sensors must similarly incorporate variability in light conditions and camera angles. Furthermore, since the reported results were validated on two smartphone platforms, expanding validation across a wider range of devices, camera systems, and alternative light sources (e.g., laser illumination) remains an important future objective. Finally, further optimization of the TensorFlow Lite integration will ensure this generalized framework remains efficient and robust for diverse paper-based analytical deployments.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d6ay00780e.
| This journal is © The Royal Society of Chemistry 2026 |