Listening with the Eyes: Benchmarking Egocentric Co-Speech Grounding across Space and Time
Weijie Zhou, Xuantang Xiong, Zhenlin Hu, Xiaomeng Zhu, Chaoyang Zhao, Honghui Dong, Zhengyou Zhang, Ming Tang, Jinqiao Wang

TL;DR
This paper introduces EcoG-Bench, a challenging benchmark for evaluating egocentric co-speech grounding that requires models to jointly predict what, where, and when, revealing significant gaps in current multimodal understanding.
Contribution
The paper presents EcoG-Bench, a new bilingual benchmark with dense annotations for evaluating egocentric speech-gesture grounding, and demonstrates the limitations of state-of-the-art models in this task.
Findings
Humans achieve near-ceiling performance (96.9%) on EcoG-Bench.
Current models perform poorly, with the best at 17.0%.
Improving audio-visual interface timing significantly boosts model performance.
Abstract
In situated collaboration, speakers often use intentionally underspecified deictic commands (e.g., ``pass me \textit{that}''), whose referent becomes identifiable only by aligning speech with a brief co-speech pointing \emph{stroke}. However, many embodied benchmarks admit language-only shortcuts, allowing MLLMs to perform well without learning the \emph{audio--visual alignment} required by deictic interaction. To bridge this gap, we introduce \textbf{Egocentric Co-Speech Grounding (EcoG)}, where grounding is executable only if an agent jointly predicts \textit{What}, \textit{Where}, and \textit{When}. To operationalize this, we present \textbf{EcoG-Bench}, an evaluation-only bilingual (EN/ZH) diagnostic benchmark of \textbf{811} egocentric clips with dense spatial annotations and millisecond-level stroke supervision. It is organized under a \textbf{Progressive Cognitive Evaluation}…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Speech Recognition and Synthesis
