VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

Haidong Xu; Guangwei Xu; Zhedong Zheng; Xiatian Zhu; Wei Ji; Xiangtai Li; Ruijie Guo; Meishan Zhang; Min zhang; Hao Fei

arXiv:2508.12081·cs.CV·October 21, 2025

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

Haidong Xu, Guangwei Xu, Zhedong Zheng, Xiatian Zhu, Wei Ji, Xiangtai Li, Ruijie Guo, Meishan Zhang, Min zhang, Hao Fei

PDF

1 Datasets 1 Video

TL;DR

VimoRAG introduces a video-based retrieval-augmented framework that enhances 3D motion generation in language models by leveraging large-scale video databases, addressing out-of-domain issues and improving performance.

Contribution

The paper presents novel retrieval and training mechanisms, Gemini Motion Video Retriever and Dual-alignment DPO Trainer, to improve motion generation in LLMs using video data.

Findings

01

Significant performance boost in motion LLMs with VimoRAG.

02

Effective retrieval of human motion signals from videos.

03

Mitigation of error propagation in retrieval-based motion generation.

Abstract

This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Haidong2/VimoRAG
dataset· 18 dl
18 dl

Videos

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models· slideslive