Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
Qi Sun, Pengfei Hong, Tej Deep Pala, Vernon Toh, U-Xuan Tan, Deepanway, Ghosal, Soujanya Poria

TL;DR
Emma-X is a novel embodied multimodal model that enhances robotic spatial reasoning and grounded task planning by leveraging hierarchical datasets and trajectory segmentation, outperforming existing methods in real-world robotic tasks.
Contribution
The paper introduces Emma-X, a new embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning, utilizing a large hierarchical dataset and a trajectory segmentation strategy.
Findings
Emma-X outperforms baseline models in real-world robotic spatial reasoning tasks.
The hierarchical dataset improves grounding accuracy and task understanding.
Trajectory segmentation reduces hallucinations in subtask reasoning.
Abstract
Traditional reinforcement learning-based robotic control methods are often task-specific and fail to generalize across diverse environments or unseen objects and instructions. Visual Language Models (VLMs) demonstrate strong scene understanding and planning capabilities but lack the ability to generate actionable policies tailored to specific robotic embodiments. To address this, Visual-Language-Action (VLA) models have emerged, yet they face challenges in long-horizon spatial reasoning and grounded task planning. In this work, we propose the Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning, Emma-X. Emma-X leverages our constructed hierarchical embodiment dataset based on BridgeV2, containing 60,000 robot manipulation trajectories auto-annotated with grounded task reasoning and spatial guidance. Additionally, we introduce a trajectory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Geographic Information Systems Studies
