FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images

Sabab Ishraq; Aarushi Aarushi; Juncai Jiang; Chen Chen

arXiv:2604.14388·cs.CV·April 20, 2026

FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images

Sabab Ishraq, Aarushi Aarushi, Juncai Jiang, Chen Chen

PDF

1 Models 1 Datasets

TL;DR

FoodSense introduces a large multisensory food dataset and benchmark for predicting taste, smell, texture, and sound from images, bridging cognitive science and multimodal AI.

Contribution

The paper presents a novel dataset, FoodSense, with annotations for multisensory food perception and a benchmark model for predicting and explaining sensory experiences from images.

Findings

01

FoodSense dataset contains 66,842 participant-image pairs.

02

FoodSense-VL model predicts multisensory ratings and provides grounded explanations.

03

Many standard metrics are inadequate for evaluating multisensory inference.

Abstract

Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
sababishraq/foodsense-vl
model· 1 dl
1 dl

Datasets

sababishraq/foodsense-dataset
dataset· 806 dl
806 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.