Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

Weijie Zhou; Xuantang Xiong; Zhenlin Hu; Xiaomeng Zhu; Chaoyang Zhao; Honghui Dong; Zhengyou Zhang; Ming Tang; Jinqiao Wang

arXiv:2603.07966·cs.CV·March 10, 2026

Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time

Weijie Zhou, Xuantang Xiong, Zhenlin Hu, Xiaomeng Zhu, Chaoyang Zhao, Honghui Dong, Zhengyou Zhang, Ming Tang, Jinqiao Wang

PDF

Open Access

TL;DR

This paper introduces EcoG-Bench, a challenging benchmark for evaluating egocentric co-speech grounding that requires models to jointly predict what, where, and when, revealing significant gaps in current multimodal understanding.

Contribution

The paper presents EcoG-Bench, a new bilingual benchmark with dense annotations for evaluating egocentric speech-gesture grounding, and demonstrates the limitations of state-of-the-art models in this task.

Findings

01

Humans achieve near-ceiling performance (96.9%) on EcoG-Bench.

02

Current models perform poorly, with the best at 17.0%.

03

Improving audio-visual interface timing significantly boosts model performance.

Abstract

In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation}…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Speech Recognition and Synthesis