Prompt-based Visual Alignment for Zero-shot Policy Transfer

Haihan Gao; Rui Zhang; Qi Yi; Hantao Yao; Haochen Li; Jiaming Guo,; Shaohui Peng; Yunkai Gao; QiCheng Wang; Xing Hu; Yuanbo Wen; Zihao Zhang,; Zidong Du; Ling Li; Qi Guo; Yunji Chen

arXiv:2406.03250·cs.CV·June 6, 2024

Prompt-based Visual Alignment for Zero-shot Policy Transfer

Haihan Gao, Rui Zhang, Qi Yi, Hantao Yao, Haochen Li, Jiaming Guo,, Shaohui Peng, Yunkai Gao, QiCheng Wang, Xing Hu, Yuanbo Wen, Zihao Zhang,, Zidong Du, Ling Li, Qi Guo, Yunji Chen

PDF

Open Access

TL;DR

This paper introduces prompt-based visual alignment (PVA), a framework that uses visual-language models and prompt tuning to improve zero-shot policy transfer in reinforcement learning by aligning images across domains with semantic constraints.

Contribution

The work presents a novel prompt-based visual alignment method that leverages visual-language models and explicit semantic constraints to enhance cross-domain generalization in RL.

Findings

01

PVA achieves strong zero-shot generalization in unseen domains.

02

The framework reduces the need for extensive multi-domain data.

03

Experiments demonstrate improved performance in autonomous driving tasks.

Abstract

Overfitting in RL has become one of the main obstacles to applications in reinforcement learning(RL). Existing methods do not provide explicit semantic constrain for the feature extractor, hindering the agent from learning a unified cross-domain representation and resulting in performance degradation on unseen domains. Besides, abundant data from multiple domains are needed. To address these issues, in this work, we propose prompt-based visual alignment (PVA), a robust framework to mitigate the detrimental domain bias in the image for zero-shot policy transfer. Inspired that Visual-Language Model (VLM) can serve as a bridge to connect both text space and image space, we leverage the semantic information contained in a text sequence as an explicit constraint to train a visual aligner. Thus, the visual aligner can map images from multiple domains to a unified domain and achieve good…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHong Kong and Taiwan Politics · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsEntropy Regularization · Proximal Policy Optimization · CARLA: An Open Urban Driving Simulator