TL;DR
DISC introduces a hypernetwork-based approach to decouple language instructions from visual observations in manipulation policies, reducing observation leakage and improving task generalization.
Contribution
It proposes a novel hypernetwork architecture that generates task-specific policies from instructions without direct language processing, enhancing robustness and interpretability.
Findings
Outperforms entangled baselines on LIBERO-90 and Meta-World datasets.
Surpasses large-scale pretrained models despite no external pretraining data.
Enables few-shot adaptation and robust generalization across paraphrased instructions.
Abstract
Language-conditioned manipulation policies typically process instructions and observations through shared network parameters. This task-state entanglement provides a pathway for observation leakage -- networks learn scene-to-action shortcuts that bypass language grounding entirely. DISC eliminates this failure structurally. Rather than conditioning a universal policy on language, DISC uses a hypernetwork to generate the entire parameter set of a task-specific visuomotor policy from the instruction alone. The generated policy never directly accesses language; therefore, its task-awareness must come from the language. Consequently, observation leakage has no pathway to emerge. On the other hand, generating coherent high-dimensional policy weights is itself a challenging problem. We address it with a two-stage hypernetwork whose refinement stage embeds the structure of gradient-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
