VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation
Kuo-Han Hung, Pang-Chi Lo, Jia-Fong Yeh, Han-Yuan Hsu, Yi-Ting Chen,, Winston H. Hsu

TL;DR
VICtoR is a hierarchical reward model that improves long-horizon manipulation by accurately assessing task progress through stages and motion evaluation, trained on videos and instructions.
Contribution
It introduces VICtoR, a hierarchical reward model with stage detection and motion evaluation for better long-horizon task learning.
Findings
Outperforms existing VIC methods by 43% in success rates.
Effective in both simulated and real-world environments.
Provides precise, multi-level reward signals for complex tasks.
Abstract
We study reward models for long-horizon manipulation tasks by learning from action-free videos and language instructions, which we term the visual-instruction correlation (VIC) problem. Recent advancements in cross-modality modeling have highlighted the potential of reward modeling through visual and language correlations. However, existing VIC methods face challenges in learning rewards for long-horizon tasks due to their lack of sub-stage awareness, difficulty in modeling task complexities, and inadequate object state estimation. To address these challenges, we introduce VICtoR, a novel hierarchical VIC reward model capable of providing effective reward signals for long-horizon manipulation tasks. VICtoR precisely assesses task progress at various levels through a novel stage detector and motion progress evaluator, offering insightful guidance for agents learning the task effectively.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Domain Adaptation and Few-Shot Learning
