How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

Bosung Kim; Ruiyi Wang; David Acuna; Jaehun Jung; Alexander Trevithick; Brandon Cui; Yejin Choi; and Prithviraj Ammanabrolu

arXiv:2605.17077·cs.RO·May 19, 2026

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

Bosung Kim, Ruiyi Wang, David Acuna, Jaehun Jung, Alexander Trevithick, Brandon Cui, Yejin Choi, and Prithviraj Ammanabrolu

PDF

TL;DR

This paper introduces DeMiAn, a dense multi-aspect annotation method that enhances robot policy learning by re-labeling demonstration data with rich language annotations, improving performance without additional data collection.

Contribution

DeMiAn leverages VLM-generated dense annotations across multiple aspects to significantly improve robot policy learning and out-of-distribution generalization.

Findings

01

DeMiAn improves success rates by 5 points over baselines.

02

It enhances performance on composite and out-of-distribution tasks.

03

Dense re-annotation shifts the compute-performance frontier.

Abstract

Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.