Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training

Yi Liu; Sukai Wang; Dafeng Wei; Xiaowei Cai; Linqing Zhong; Jiange Yang; Guanghui Ren; Jinyu Zhang; Maoqing Yao; Chuankang Li; Xindong He; Liliang Chen; Jianlan Luo

arXiv:2512.24125·cs.RO·January 5, 2026

Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training

Yi Liu, Sukai Wang, Dafeng Wei, Xiaowei Cai, Linqing Zhong, Jiange Yang, Guanghui Ren, Jinyu Zhang, Maoqing Yao, Chuankang Li, Xindong He, Liliang Chen, Jianlan Luo

PDF

Open Access

TL;DR

This paper introduces ERIQ, a benchmark for embodied reasoning in robotics, and FACT, a discrete action tokenizer, to improve the integration of reasoning and precise control in robotic manipulation.

Contribution

It presents ERIQ for systematic evaluation of embodied reasoning and proposes FACT to bridge reasoning and control, enabling better robotic manipulation performance.

Findings

01

ERIQ reveals a strong correlation between reasoning and generalization.

02

FACT improves trajectory fidelity in discrete control sequences.

03

GenieReasoner outperforms prior methods in real-world robotic tasks.

Abstract

General-purpose robotic systems operating in open-world environments must achieve both broad generalization and high-precision action execution, a combination that remains challenging for existing Vision-Language-Action (VLA) models. While large Vision-Language Models (VLMs) improve semantic generalization, insufficient embodied reasoning leads to brittle behavior, and conversely, strong reasoning alone is inadequate without precise control. To provide a decoupled and quantitative assessment of this bottleneck, we introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, comprising 6K+ question-answer pairs across four reasoning dimensions. By decoupling reasoning from execution, ERIQ enables systematic evaluation and reveals a strong positive correlation between embodied reasoning capability and end-to-end VLA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI