OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

Zixuan Wang; Dingming Li; Hongxing Li; Shuo Chen; Yuchen Yan; Wenqi Zhang; Yongliang Shen; Weiming Lu; Jun Xiao; Yueting Zhuang

arXiv:2508.05614·cs.CL·August 8, 2025

OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

Zixuan Wang, Dingming Li, Hongxing Li, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

PDF

1 Datasets

TL;DR

OmniEAR is a comprehensive benchmark that evaluates large language models' ability to reason about physical interactions, tool use, and multi-agent coordination in diverse embodied tasks, revealing significant performance gaps and limitations.

Contribution

The paper introduces OmniEAR, a novel benchmark for assessing embodied reasoning in language models, highlighting current limitations and architectural challenges.

Findings

01

Models perform well with explicit instructions (85-96%)

02

Performance drops significantly in tool reasoning and implicit collaboration

03

Complete environmental information can impair coordination performance

Abstract

Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

wangzx1210/OmniEAR
dataset· 63 dl
63 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.