Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning
Wentao Wang, Chunyang Liu, Kehua Sheng, Bo Zhang, Yan Wang

TL;DR
Semore introduces a VLM-guided framework that enhances semantic and motion representations in visual reinforcement learning, leading to more efficient and adaptive decision-making by integrating common-sense knowledge and pre-trained models.
Contribution
The paper proposes a novel VLM-based framework, Semore, that extracts and fuses semantic and motion representations for improved visual RL performance.
Findings
Outperforms state-of-the-art methods in experiments.
Efficiently fuses semantic and motion features.
Demonstrates adaptive decision-making capabilities.
Abstract
The growing exploration of Large Language Models (LLM) and Vision-Language Models (VLM) has opened avenues for enhancing the effectiveness of reinforcement learning (RL). However, existing LLM-based RL methods often focus on the guidance of control policy and encounter the challenge of limited representations of the backbone networks. To tackle this problem, we introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual RL, which can simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows. Semore utilizes VLM with common-sense knowledge to retrieve key information from observations, while using the pre-trained clip to achieve the text-image alignment, thereby embedding the ground-truth representations into the backbone. To efficiently fuse semantic and motion representations for decision-making, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation
