TL;DR
This paper introduces MINT, a novel imitation learning approach that disentangles intent from execution using multi-scale spectral tokenization, enabling better transfer, generalization, and efficiency in manipulation tasks.
Contribution
MINT employs multi-scale frequency-space tokenization to explicitly separate intent from execution, improving transferability and generalization in imitation learning.
Findings
Achieves state-of-the-art success rates on manipulation benchmarks.
Enables effective one-shot skill transfer by injecting intent tokens.
Demonstrates robustness and efficiency in real robot experiments.
Abstract
While imitation learning (IL) has achieved impressive success in dexterous manipulation through generative modeling and pretraining, state-of-the-art approaches like Vision-Language-Action (VLA) models still struggle with adaptation to environmental changes and skill transfer. We argue this stems from mimicking raw trajectories without understanding the underlying intent. To address this, we propose explicitly disentangling behavior intent from execution details in end-2-end IL: Mimic Intent, Not just Trajectories(MINT). We achieve this via multi-scale frequency-space tokenization, which enforces a spectral decomposition of action chunk representation. We learn action tokens with a multi-scale coarse-to-fine structure, and force the coarsest token to capture low-frequency global structure and finer tokens to encode high-frequency details. This yields an abstract Intent token that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
