Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores
Zhiyong Shen, Gongpeng Zhao, Jun Zhou, Li Yu, Guandong Kou, Jichen Li, Chuanlei Dong, Zuncheng Li, Kaimao Li, Bingkun Wei, Shicheng Hu, Wei Xia, Wenguo Duan

TL;DR
Ostrakon-VL is a domain-specific multimodal language model tailored for food-service and retail environments, featuring a new benchmark and data curation pipeline to improve robustness and efficiency.
Contribution
The paper introduces Ostrakon-VL, a specialized MLLM for FSRS, along with ShopBench benchmark and QUAD data curation pipeline, advancing robustness and parameter efficiency.
Findings
Achieved state-of-the-art 60.1 score on ShopBench
Outperformed larger models like Qwen3-VL-235B by +0.7 points
Demonstrated improved parameter efficiency and robustness
Abstract
Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
