FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images
Sabab Ishraq, Aarushi Aarushi, Juncai Jiang, Chen Chen

TL;DR
FoodSense introduces a large multisensory food dataset and benchmark for predicting taste, smell, texture, and sound from images, bridging cognitive science and multimodal AI.
Contribution
The paper presents a novel dataset, FoodSense, with annotations for multisensory food perception and a benchmark model for predicting and explaining sensory experiences from images.
Findings
FoodSense dataset contains 66,842 participant-image pairs.
FoodSense-VL model predicts multisensory ratings and provides grounded explanations.
Many standard metrics are inadequate for evaluating multisensory inference.
Abstract
Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
