J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception
Jesse Atuhurra, Hidetaka Kamigaito, Taro Watanabe, Koichiro Yoshino

TL;DR
J-ORA introduces a comprehensive multimodal dataset with detailed object attributes for Japanese human-robot dialogue, significantly enhancing perception tasks like identification, reference resolution, and action prediction, while highlighting the performance gap between different vision language models.
Contribution
The paper presents J-ORA, a new dataset with detailed object attributes tailored for Japanese robot perception, and evaluates its impact on multimodal perception tasks using various models.
Findings
Object attribute annotations improve perception performance.
Proprietary VLMs outperform open-source models.
Understanding object affordances varies across models.
Abstract
We introduce J-ORA, a novel multimodal dataset that bridges the gap in robot perception by providing detailed object attribute annotations within Japanese human-robot dialogue scenarios. J-ORA is designed to support three critical perception tasks, object identification, reference resolution, and next-action prediction, by leveraging a comprehensive template of attributes (e.g., category, color, shape, size, material, and spatial relations). Extensive evaluations with both proprietary and open-source Vision Language Models (VLMs) reveal that incorporating detailed object attributes substantially improves multimodal perception performance compared to without object attributes. Despite the improvement, we find that there still exists a gap between proprietary and open-source VLMs. In addition, our analysis of object affordances demonstrates varying abilities in understanding object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
