VideoAtlas: Navigating Long-Form Video in Logarithmic Compute
Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain, Naeemullah Khan

TL;DR
VideoAtlas introduces a hierarchical, lossless environment for long-form video navigation, enabling scalable, logarithmic compute growth and robust understanding across extended durations using recursive language models.
Contribution
The paper presents VideoAtlas, a novel environment that structures video as a hierarchical grid, facilitating lossless, scalable navigation and enabling the extension of recursive language models to video understanding.
Findings
Logarithmic compute growth with video duration.
30-60% multimodal cache hit rate from grid reuse.
Robust performance on 1-hour to 10-hour video benchmarks.
Abstract
Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent's memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)
