How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning
Bosung Kim, Ruiyi Wang, David Acuna, Jaehun Jung, Alexander Trevithick, Brandon Cui, Yejin Choi, and Prithviraj Ammanabrolu

TL;DR
This paper introduces DeMiAn, a dense multi-aspect annotation method that enhances robot policy learning by re-labeling demonstration data with rich language annotations, improving performance without additional data collection.
Contribution
DeMiAn leverages VLM-generated dense annotations across multiple aspects to significantly improve robot policy learning and out-of-distribution generalization.
Findings
DeMiAn improves success rates by 5 points over baselines.
It enhances performance on composite and out-of-distribution tasks.
Dense re-annotation shifts the compute-performance frontier.
Abstract
Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
