J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception

Jesse Atuhurra; Hidetaka Kamigaito; Taro Watanabe; Koichiro Yoshino

arXiv:2510.21761·cs.RO·October 28, 2025

J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception

Jesse Atuhurra, Hidetaka Kamigaito, Taro Watanabe, Koichiro Yoshino

PDF

TL;DR

J-ORA introduces a comprehensive multimodal dataset with detailed object attributes for Japanese human-robot dialogue, significantly enhancing perception tasks like identification, reference resolution, and action prediction, while highlighting the performance gap between different vision language models.

Contribution

The paper presents J-ORA, a new dataset with detailed object attributes tailored for Japanese robot perception, and evaluates its impact on multimodal perception tasks using various models.

Findings

01

Object attribute annotations improve perception performance.

02

Proprietary VLMs outperform open-source models.

03

Understanding object affordances varies across models.

Abstract

We introduce J-ORA, a novel multimodal dataset that bridges the gap in robot perception by providing detailed object attribute annotations within Japanese human-robot dialogue scenarios. J-ORA is designed to support three critical perception tasks, object identification, reference resolution, and next-action prediction, by leveraging a comprehensive template of attributes (e.g., category, color, shape, size, material, and spatial relations). Extensive evaluations with both proprietary and open-source Vision Language Models (VLMs) reveal that incorporating detailed object attributes substantially improves multimodal perception performance compared to without object attributes. Despite the improvement, we find that there still exists a gap between proprietary and open-source VLMs. In addition, our analysis of object affordances demonstrates varying abilities in understanding object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.