E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Tweedie Discrete Diffusion

Zhihao Zhan; Jiaying Zhou; Likui Zhang; Qinhan Lv; Hao Liu; Jusheng Zhang; Weizheng Li; Ziliang Chen; Tianshui Chen; Ruifeng Zhai; Keze Wang; Liang Lin; Guangrun Wang

arXiv:2511.21542·cs.RO·March 26, 2026

E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Tweedie Discrete Diffusion

Zhihao Zhan, Jiaying Zhou, Likui Zhang, Qinhan Lv, Hao Liu, Jusheng Zhang, Weizheng Li, Ziliang Chen, Tianshui Chen, Ruifeng Zhai, Keze Wang, Liang Lin, Guangrun Wang

PDF

Open Access

TL;DR

E0 introduces a discrete diffusion approach for VLA models that improves generalization, fine-grained control, and robustness across diverse robotic tasks and environments.

Contribution

The paper proposes E0, a novel Tweedie discrete diffusion framework that aligns with token-based reasoning and enhances action generation in VLA models.

Findings

01

Achieves state-of-the-art performance on 14 environments.

02

Outperforms baselines by 10.7% on average.

03

Enhances robustness to camera shifts with augmentation.

Abstract

Vision-Language-Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. However, existing VLA systems still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We argue that these limitations are closely tied to the structural properties of actions in VLA settings, including the inherent multi-peaked nature of action distributions, the token-based symbolic reasoning of pretrained VLM/VLA backbones, and the effective finite resolution imposed by real-world robotic control. Motivated by these properties, we introduce E0, a tweedie discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. By operating in a discrete action space with a principled diffusion process, E0…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis