TL;DR
IntentVLA is a novel framework that encodes recent observations to model short-horizon intents, improving stability and performance in aliasing-prone robot manipulation tasks.
Contribution
The paper introduces IntentVLA, a history-conditioned VLA approach, and AliasBench, a benchmark for short-horizon intent ambiguity, addressing partial observability issues.
Findings
IntentVLA improves rollout stability across multiple benchmarks.
IntentVLA outperforms existing VLA baselines.
AliasBench isolates short-horizon observation aliasing effectively.
Abstract
Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
