SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models
Ye Sun, Hao Zhang, Henghui Ding, Tiehua Zhang, Xingjun Ma, Yu-Gang Jiang

TL;DR
This paper introduces SAMA, a comprehensive framework including a large dataset, a novel model, and a benchmark for multi-turn referential grounded video chat, significantly advancing fine-grained video understanding and grounded dialogue capabilities.
Contribution
The paper presents SAMA-239K dataset, SAMA model with spatio-temporal aggregation and grounding, and SAMA-Bench benchmark, enabling unified multi-turn video referential understanding.
Findings
SAMA achieves state-of-the-art results on grounding benchmarks.
SAMA performs strongly on multi-turn video chat tasks.
The dataset and benchmark facilitate comprehensive evaluation of video LMMs.
Abstract
Achieving fine-grained spatio-temporal understanding in videos remains a major challenge for current Video Large Multimodal Models (Video LMMs). Addressing this challenge requires mastering two core capabilities: video referring understanding, which captures the semantics of video regions, and video grounding, which segments object regions based on natural language descriptions. However, most existing approaches tackle these tasks in isolation, limiting progress toward unified, referentially grounded video interaction. We identify a key bottleneck in the lack of high-quality, unified video instruction data and a comprehensive benchmark for evaluating referentially grounded video chat. To address these challenges, we contribute in three core aspects: dataset, model, and benchmark. First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically curated to enable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
