Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating   Vision-Language Models

Zheng Ma; Mianzhi Pan; Wenhan Wu; Kanzhi Cheng; Jianbing Zhang,; Shujian Huang; Jiajun Chen

arXiv:2308.03151·cs.CV·August 8, 2023

Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

Zheng Ma, Mianzhi Pan, Wenhan Wu, Kanzhi Cheng, Jianbing Zhang,, Shujian Huang, Jiajun Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces Food-500 Cap, a detailed food caption dataset, to evaluate vision-language models' performance and biases in the food domain, revealing their limitations and regional biases.

Contribution

The creation of Food-500 Cap dataset with fine-grained food attributes and geographic classification for evaluating VLMs in a specialized domain.

Findings

01

VLMs underperform in the food domain compared to general tasks.

02

Significant regional bias exists in VLMs' food recognition abilities.

03

VLMs show limitations in handling fine-grained food attributes.

Abstract

Vision-language models (VLMs) have shown impressive performance in substantial downstream multi-modal tasks. However, only comparing the fine-tuned performance on downstream tasks leads to the poor interpretability of VLMs, which is adverse to their future improvement. Several prior works have identified this issue and used various probing methods under a zero-shot setting to detect VLMs' limitations, but they all examine VLMs using general datasets instead of specialized ones. In practical applications, VLMs are usually applied to specific scenarios, such as e-commerce and news fields, so the generalization of VLMs in specific domains should be given more attention. In this paper, we comprehensively investigate the capabilities of popular VLMs in a specific field, the food domain. To this end, we build a food caption dataset, Food-500 Cap, which contains 24,700 food images with 494…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aaronma2020/Food500-Cap
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques