Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning

Wentao Wang; Chunyang Liu; Kehua Sheng; Bo Zhang; Yan Wang

arXiv:2512.05172·cs.CV·December 8, 2025

Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning

Wentao Wang, Chunyang Liu, Kehua Sheng, Bo Zhang, Yan Wang

PDF

Open Access 1 Video

TL;DR

Semore introduces a VLM-guided framework that enhances semantic and motion representations in visual reinforcement learning, leading to more efficient and adaptive decision-making by integrating common-sense knowledge and pre-trained models.

Contribution

The paper proposes a novel VLM-based framework, Semore, that extracts and fuses semantic and motion representations for improved visual RL performance.

Findings

01

Outperforms state-of-the-art methods in experiments.

02

Efficiently fuses semantic and motion features.

03

Demonstrates adaptive decision-making capabilities.

Abstract

The growing exploration of Large Language Models (LLM) and Vision-Language Models (VLM) has opened avenues for enhancing the effectiveness of reinforcement learning (RL). However, existing LLM-based RL methods often focus on the guidance of control policy and encounter the challenge of limited representations of the backbone networks. To tackle this problem, we introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual RL, which can simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows. Semore utilizes VLM with common-sense knowledge to retrieve key information from observations, while using the pre-trained clip to achieve the text-image alignment, thereby embedding the ground-truth representations into the backbone. To efficiently fuse semantic and motion representations for decision-making, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation