TL;DR
AnchorVLA is a diffusion-based policy for mobile manipulation that efficiently generates multimodal actions with low inference cost, improving success and stability in dynamic tasks.
Contribution
It introduces an anchored diffusion approach with a self-correction mechanism to enable reactive, multimodal control in mobile manipulation.
Findings
Improves success rates in diverse mobile manipulation tasks.
Reduces inference time compared to full diffusion models.
Enhances stability under disturbances and distribution shifts.
Abstract
A central challenge in mobile manipulation is preserving multiple plausible action models while remaining reactive during execution. A bottle in a cluttered scene can often be approached and grasped in multiple valid ways. Robust behavior depends on preserving this action diversity while remaining reactive as the scene evolves. Diffusion policies are appealing because they model multimodal action distributions rather than collapsing to one solution. But in practice, full iterative denoising is costly at control time. Action chunking helps amortize inference, yet it also creates partially open-loop behavior, allowing small mismatches to accumulate into drift. We present AnchorVLA, a diffusion-based VLA policy for mobile manipulation built on the core insight that when sampling begins near a plausible solution manifold, extensive denoising is unnecessary to recover multimodal, valid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
