ViCaS: A Dataset for Combining Holistic and Pixel-level Video   Understanding using Captions with Grounded Segmentation

Ali Athar; Xueqing Deng; Liang-Chieh Chen

arXiv:2412.09754·cs.CV·April 4, 2025

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

Ali Athar, Xueqing Deng, Liang-Chieh Chen

PDF

Open Access 2 Models 1 Datasets

TL;DR

ViCaS introduces a comprehensive dataset combining detailed video captions with pixel-level object segmentation, enabling unified evaluation of high-level understanding and precise localization in videos.

Contribution

The paper presents ViCaS, a novel dataset that unifies high-level video captioning and pixel-precise segmentation with grounded language annotations, along with a new benchmark and model architecture.

Findings

01

ViCaS dataset contains thousands of videos with detailed captions and segmentation masks.

02

Proposed model architecture effectively handles both holistic understanding and pixel-level segmentation.

03

Evaluation measures demonstrate the dataset's utility for advancing multimodal video understanding.

Abstract

Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Ali2500/ViCaS
dataset· 145 dl
145 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization