SiLVR: A Simple Language-based Video Reasoning Framework

Ce Zhang; Yan-Bo Lin; Ziyang Wang; Mohit Bansal; Gedas Bertasius

arXiv:2505.24869·cs.CV·April 16, 2026

SiLVR: A Simple Language-based Video Reasoning Framework

Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, Gedas Bertasius

PDF

2 Repos

TL;DR

SILVR is a modular, training-free framework that transforms videos into language-based representations and uses large language models for complex video reasoning tasks, achieving state-of-the-art results.

Contribution

We introduce SILVR, a simple, two-stage, language-based video reasoning framework that effectively leverages LLMs without additional training.

Findings

01

Achieves top results on multiple video reasoning benchmarks.

02

Effectively aggregates multisensory video, speech, and audio inputs.

03

Strong reasoning LLMs can handle complex temporal and causal video tasks.

Abstract

Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SILVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SILVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an Adaptive Context Reduction scheme, which dynamically determines the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.