VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

Yufei Yin; Qianke Meng; Minghao Chen; Jiajun Ding; Zhenwei Shao; Zhou Yu

arXiv:2512.12360·cs.CV·March 31, 2026

VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao, Zhou Yu

PDF

TL;DR

VideoARM introduces an agentic, hierarchical memory framework for long-form video understanding, enabling adaptive reasoning and reducing token consumption compared to existing methods.

Contribution

It proposes a novel agentic reasoning paradigm with hierarchical memory that adaptively interprets videos, improving efficiency and performance over prior static approaches.

Findings

01

Outperforms state-of-the-art DVD method on benchmarks.

02

Reduces token consumption significantly during video processing.

03

Demonstrates effective adaptive reasoning in long-form video understanding.

Abstract

Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.