Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents
Zhizhen Zhang, Lei Zhu, Zhen Fang, Zi Huang, Yadan Luo

TL;DR
This paper introduces AcTOL, a novel method for pretraining vision-language models that emphasizes temporal coherence and ordering without relying on goal-based heuristics, improving embodied agent performance.
Contribution
It proposes AcTOL, a new approach that learns ordered and continuous vision-language representations using semantic contrast and Brownian bridge constraints, avoiding goal-reach heuristics.
Findings
Pretrained features improve downstream manipulation tasks.
Method shows robustness to diverse linguistic instructions.
Enhances generalization in embodied agents.
Abstract
Pre-training vision-language representations on human action videos has emerged as a promising approach to reduce reliance on large-scale expert demonstrations for training embodied agents. However, prior methods often employ time contrastive learning based on goal-reaching heuristics, progressively aligning language instructions from the initial to the final frame. This overemphasis on future frames can result in erroneous vision-language associations, as actions may terminate early or include irrelevant moments in the end. To address this issue, we propose Action Temporal Coherence Learning (AcTOL) to learn ordered and continuous vision-language representations without rigid goal-based constraint. AcTOL treats a video as a continuous trajectory where it (1) contrasts semantic differences between frames to reflect their natural ordering, and (2) imposes a local Brownian bridge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsContrastive Learning
