FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese   Food Culture

Wenyan Li; Xinyu Zhang; Jiaang Li; Qiwei Peng; Raphael Tang; Li Zhou,; Weijia Zhang; Guimin Hu; Yifei Yuan; Anders S{\o}gaard; Daniel Hershcovich,; Desmond Elliott

arXiv:2406.11030·cs.CL·October 1, 2024

FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture

Wenyan Li, Xinyu Zhang, Jiaang Li, Qiwei Peng, Raphael Tang, Li Zhou,, Weijia Zhang, Guimin Hu, Yifei Yuan, Anders S{\o}gaard, Daniel Hershcovich,, Desmond Elliott

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

FoodieQA is a comprehensive multimodal dataset that enables fine-grained understanding of Chinese food culture, highlighting the challenges faced by vision-language models in capturing cultural and regional food diversity.

Contribution

The paper introduces FoodieQA, a new dataset for multimodal food understanding, and evaluates the performance of vision-language models and large language models on this culturally rich dataset.

Findings

01

LLMs outperform VLMs on text-only questions

02

VLMs lag behind humans on multi-image questions

03

Closed-weights VLMs perform closer to human accuracy

Abstract

Food is a rich and varied dimension of cultural heritage, crucial to both individuals and social groups. To bridge the gap in the literature on the often-overlooked regional diversity in this domain, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China. We evaluate vision-language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions. FoodieQA comprises three multiple-choice question-answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions, respectively. While LLMs excel at text-based question answering, surpassing human accuracy, the open-sourced VLMs still fall short by 41% on multi-image and 21% on single-image VQA tasks, although closed-weights models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lyan62/FoodieQA
pytorchOfficial

Datasets

lyan62/FoodieQA
dataset· 49 dl
49 dl

Videos

FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture· underline

Taxonomy

TopicsCulinary Culture and Tourism