ReMoT: Reinforcement Learning with Motion Contrast Triplets

Cong Wan; Zeyu Guo; Jiangyang Li; SongLin Dong; Yifan Bai; Lin Peng; Zhiheng Ma; Yihong Gong

arXiv:2603.00461·cs.CV·March 23, 2026

ReMoT: Reinforcement Learning with Motion Contrast Triplets

Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong

PDF

Open Access

TL;DR

ReMoT introduces a novel contrastive learning framework with a large-scale motion-contrast dataset and a new benchmark, significantly improving spatio-temporal reasoning in vision-language models.

Contribution

The paper presents ReMoT, a unified training paradigm combining a large motion-contrast dataset and a group relative policy optimization method for enhanced spatio-temporal reasoning.

Findings

01

Achieved state-of-the-art results on new and existing benchmarks.

02

Constructed the first fine-grained motion contrast triplet benchmark.

03

Realized a 25.1% performance improvement on spatio-temporal tasks.

Abstract

We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis