Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

Dayong Liu; Chao Xu; Weihong Chen; Suyu Zhang; Juncheng Wang; Jiankang Deng; Baigui Sun; Yang Liu

arXiv:2511.18685·cs.CV·March 13, 2026

Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

Dayong Liu, Chao Xu, Weihong Chen, Suyu Zhang, Juncheng Wang, Jiankang Deng, Baigui Sun, Yang Liu

PDF

Open Access

TL;DR

This paper introduces CFG-Bench, a comprehensive benchmark for evaluating fine-grained action understanding in embodied agents, revealing current models' limitations and the benefits of supervised fine-tuning.

Contribution

We present CFG-Bench, a new benchmark with videos and questions to systematically assess fine-grained action reasoning in embodied agents, and show that fine-tuning improves performance.

Findings

01

Leading MLLMs struggle with detailed physical instructions.

02

Models have limitations in higher-order reasoning about intention.

03

Supervised fine-tuning significantly improves embodied benchmark performance.

Abstract

Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 question-answer pairs spanning three evaluation paradigms targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization