Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents
Dayong Liu, Chao Xu, Weihong Chen, Suyu Zhang, Juncheng Wang, Jiankang Deng, Baigui Sun, Yang Liu

TL;DR
This paper introduces CFG-Bench, a comprehensive benchmark for evaluating fine-grained action understanding in embodied agents, revealing current models' limitations and the benefits of supervised fine-tuning.
Contribution
We present CFG-Bench, a new benchmark with videos and questions to systematically assess fine-grained action reasoning in embodied agents, and show that fine-tuning improves performance.
Findings
Leading MLLMs struggle with detailed physical instructions.
Models have limitations in higher-order reasoning about intention.
Supervised fine-tuning significantly improves embodied benchmark performance.
Abstract
Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 question-answer pairs spanning three evaluation paradigms targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization
