SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval   and Routing in Long-Form Video Analysis

Junho Kim; Hyunjun Kim; Hosu Lee; Yong Man Ro

arXiv:2411.16173·cs.CV·March 24, 2025

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

Junho Kim, Hyunjun Kim, Hosu Lee, Yong Man Ro

PDF

Open Access 1 Datasets

TL;DR

SALOVA is a novel framework that improves understanding and retrieval of long-form videos by segmenting content, enabling targeted responses, and maintaining context over extended sequences, addressing current limitations of large multi-modal models.

Contribution

The paper introduces SALOVA, a new video-LLM framework with a novel dataset and architectural innovations for better long video comprehension and targeted retrieval.

Findings

01

Enhanced retrieval accuracy in long videos

02

Improved contextual relevance in responses

03

Effective processing of complex long-form content

Abstract

Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

IVLLab/SceneWalk
dataset· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition