TL;DR
SILVR is a modular, training-free framework that transforms videos into language-based representations and uses large language models for complex video reasoning tasks, achieving state-of-the-art results.
Contribution
We introduce SILVR, a simple, two-stage, language-based video reasoning framework that effectively leverages LLMs without additional training.
Findings
Achieves top results on multiple video reasoning benchmarks.
Effectively aggregates multisensory video, speech, and audio inputs.
Strong reasoning LLMs can handle complex temporal and causal video tasks.
Abstract
Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SILVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SILVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an Adaptive Context Reduction scheme, which dynamically determines the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
