ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

Daichi Yashima; Shuhei Kurita; Yusuke Oda; Komei Sugiura

arXiv:2602.16412·cs.CV·February 24, 2026

ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

Daichi Yashima, Shuhei Kurita, Yusuke Oda, Komei Sugiura

PDF

Open Access

TL;DR

ReMoRa is a novel multimodal large language model that efficiently processes long videos by operating on compressed motion representations, enabling improved understanding while reducing computational complexity.

Contribution

The paper introduces ReMoRa, a method that uses compressed motion representations and denoising to handle long videos efficiently for large language models.

Findings

01

ReMoRa outperforms baseline methods on multiple long-video benchmarks.

02

The model scales linearly with sequence length, improving efficiency.

03

Effective encoding of temporal dynamics as motion representations enhances understanding.

Abstract

While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition