DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

Song Jin; Juntian Zhang; Xun Zhang; Zeying Tian; Fei Jiang; Guojun Yin; Wei Lin; Yong Liu; Rui Yan

arXiv:2604.10425·cs.CV·April 14, 2026

DiningBench: A Hierarchical Multi-view Benchmark for Perception and Reasoning in the Dietary Domain

Song Jin, Juntian Zhang, Xun Zhang, Zeying Tian, Fei Jiang, Guojun Yin, Wei Lin, Yong Liu, Rui Yan

PDF

1 Repo 1 Datasets

TL;DR

DiningBench is a comprehensive, hierarchical benchmark for evaluating vision-language models in the food domain, emphasizing fine-grained classification, nutrition estimation, and visual question answering.

Contribution

It introduces a new multi-view, hierarchical dataset with rigorous nutritional data, and provides extensive evaluation revealing current models' limitations in fine-grained and nutritional reasoning.

Findings

01

Current VLMs excel at general reasoning but struggle with fine-grained visual discrimination.

02

Multi-view inputs and Chain-of-Thought reasoning impact model performance.

03

Identified five primary failure modes in food-centric VLMs.

Abstract

Recent advancements in Vision-Language Models (VLMs) have revolutionized general visual understanding. However, their application in the food domain remains constrained by benchmarks that rely on coarse-grained categories, single-view imagery, and inaccurate metadata. To bridge this gap, we introduce DiningBench, a hierarchical, multi-view benchmark designed to evaluate VLMs across three levels of cognitive complexity: Fine-Grained Classification, Nutrition Estimation, and Visual Question Answering. Unlike previous datasets, DiningBench comprises 3,021 distinct dishes with an average of 5.27 images per entry, incorporating fine-grained "hard" negatives from identical menus and rigorous, verification-based nutritional data. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary models. Our experiments reveal that while current VLMs excel at general…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

meituan/DiningBench
github

Datasets

meituan/DiningBench
dataset· 132 dl
132 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.