An image-based food quality analysis framework driven by large language models
Abstract
Although computer vision approaches based on machine learning (ML) and deep learning (DL) have been applied to image-based food quality analysis, they typically require large labeled datasets, intensive training, and expert knowledge. Recent advances in large language models (LLMs) have enabled enhanced reasoning capabilities and flexible multimodal perception. In this work, an LLM-driven vision inspection framework featuring two inference routes was developed. The text-only inference (TOI) converts pre-extracted image features into structured textual prompts for LLM-based reasoning. The vision-language inference (VLI) directly processes raw food images using LLMs with built-in visual encoders, enabling end-to-end multimodal inference. A lightweight graphical user interface (GUI) was integrated to automate feature extraction and structured data generation with one-click operation. Several state-of-the-art LLMs were evaluated on food ripeness, freshness, and authenticity datasets. Using GPT-5, the TOI approach attained an accuracy of 1.000 on low-complexity datasets and maintained strong performance (0.875) on more complex tasks, while the VLI approach demonstrated consistently high accuracy (0.964–1.000) with substantially fewer training images, matching or surpassing traditional ML and DL baselines. The proposed framework relies solely on prompt-based in-context learning, eliminating task-specific fine-tuning. These results demonstrate the feasibility and practicality of efficient, fine-tuning-free, LLM-driven vision inspection for food quality analysis.

Please wait while we load your content...