MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue
Anna Deichler, Jim O'Regan, Fethiye Irmak Dogan, Lubos Marcinek, Anna Klezovich, Iolanda Leite, and Jonas Beskow

TL;DR
This paper introduces MM-Conv, a new benchmark and dataset for context-aware grounding in 3D dialogue, emphasizing the importance of explicit ambiguity resolution in dynamic environments.
Contribution
It presents a novel multimodal dataset from VR interactions and a two-stage grounding pipeline that improves accuracy by explicitly resolving conversational ambiguity.
Findings
Contextual rewriting improves grounding accuracy by 11-22 percentage points.
GroundingDINO detector reaches 56.7% on pronominal references after rewriting.
Decoupling linguistic reasoning from visual perception enhances performance.
Abstract
Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing (1) a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry, and (2) a two-stage grounding pipeline that explicitly resolves conversational ambiguity before visual localization. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types. Our contextual rewriting approach improves grounding performance by 11-22 percentage points on average, with a pure detector (GroundingDINO) reaching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
