ViLLa: Video Reasoning Segmentation with Large Language Model

Rongkun Zheng; Lu Qi; Xi Chen; Yi Wang; Kun Wang; Yu Qiao; Hengshuang; Zhao

arXiv:2407.14500·cs.CV·March 18, 2025·2 cites

ViLLa: Video Reasoning Segmentation with Large Language Model

Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang, Zhao

PDF

Open Access 1 Repo

TL;DR

ViLLa introduces a novel approach combining dynamic context encoding, hierarchical temporal modeling, and adaptive segmentation to significantly improve video reasoning segmentation in complex, real-world scenarios using large language models.

Contribution

The paper presents ViLLa, a new framework that enhances video reasoning segmentation by addressing complex scenarios with innovative modules and introduces a new benchmark dataset.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Effectively handles long videos with complex object interactions.

03

Demonstrates improved reasoning accuracy in real-world scenes.

Abstract

Recent efforts in video reasoning segmentation (VRS) integrate large language models (LLMs) with perception models to localize and track objects via textual instructions, achieving barely satisfactory results in simple scenarios. However, they struggled to discriminate and deduce the objects from user queries in more real-world scenes featured by long durations, multiple objects, rapid motion, and heavy occlusions. In this work, we analyze the underlying causes of these limitations, and present ViLLa: Video reasoning segmentation with Large Language Model. Remarkably, our ViLLa manages to tackle these challenges through multiple core innovations: (1) a context synthesizer that dynamically encodes the user intent with video contexts for accurate reasoning, resolving ambiguities in complex queries, and (2) a hierarchical temporal synchronizer that disentangles multi-object interactions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rkzheng99/villa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling