Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

Biswesh Mohapatra; Giovanni Duca; Laurent Romary; Justine Cassell

arXiv:2604.21144·cs.CL·April 24, 2026

Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

Biswesh Mohapatra, Giovanni Duca, Laurent Romary, Justine Cassell

PDF

TL;DR

This paper explores how multimodal, visual scaffolding can enhance shared context representation in situated dialogue, reducing semantic flattening and improving grounded response generation.

Contribution

It introduces an active visual scaffolding framework that converts dialogue state into persistent visual representations, improving context tracking in conversational agents.

Findings

01

Visual scaffolding reduces representational blur in dialogue.

02

Hybrid multimodal representations outperform purely textual or visual approaches.

03

Incremental externalization improves dialogue reasoning over full-dialog approaches.

Abstract

Situated dialogue requires speakers to maintain a reliable representation of shared context rather than reasoning only over isolated utterances. Current conversational agents often struggle with this requirement, especially when the common ground must be preserved beyond the immediate context window. In such settings, fine-grained distinctions are frequently compressed into purely textual representations, leading to a critical failure mode we call \emph{representational blur}, in which similar but distinct entities collapse into interchangeable descriptions. This semantic flattening creates an illusion of grounding, where agents appear locally coherent but fail to track shared context persistently over time. Inspired by the role of mental imagery in human reasoning, and based on the increased availability of multimodal models, we explore whether conversational agents can be given an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.