Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents

Zhizhen Zhang; Lei Zhu; Zhen Fang; Zi Huang; Yadan Luo

arXiv:2502.01218·cs.RO·December 19, 2025

Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents

Zhizhen Zhang, Lei Zhu, Zhen Fang, Zi Huang, Yadan Luo

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces AcTOL, a novel method for pretraining vision-language models that emphasizes temporal coherence and ordering without relying on goal-based heuristics, improving embodied agent performance.

Contribution

It proposes AcTOL, a new approach that learns ordered and continuous vision-language representations using semantic contrast and Brownian bridge constraints, avoiding goal-reach heuristics.

Findings

01

Pretrained features improve downstream manipulation tasks.

02

Method shows robustness to diverse linguistic instructions.

03

Enhances generalization in embodied agents.

Abstract

Pre-training vision-language representations on human action videos has emerged as a promising approach to reduce reliance on large-scale expert demonstrations for training embodied agents. However, prior methods often employ time contrastive learning based on goal-reaching heuristics, progressively aligning language instructions from the initial to the final frame. This overemphasis on future frames can result in erroneous vision-language associations, as actions may terminate early or include irrelevant moments in the end. To address this issue, we propose Action Temporal Coherence Learning (AcTOL) to learn ordered and continuous vision-language representations without rigid goal-based constraint. AcTOL treats a video as a continuous trajectory where it (1) contrasts semantic differences between frames to reflect their natural ordering, and (2) imposes a local Brownian bridge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

daisy-zzz/actol
pytorchOfficial

Videos

Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Learning