SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models

Ye Sun; Hao Zhang; Henghui Ding; Tiehua Zhang; Xingjun Ma; Yu-Gang Jiang

arXiv:2505.18812·cs.CV·October 27, 2025

SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models

Ye Sun, Hao Zhang, Henghui Ding, Tiehua Zhang, Xingjun Ma, Yu-Gang Jiang

PDF

Open Access

TL;DR

This paper introduces SAMA, a comprehensive framework including a large dataset, a novel model, and a benchmark for multi-turn referential grounded video chat, significantly advancing fine-grained video understanding and grounded dialogue capabilities.

Contribution

The paper presents SAMA-239K dataset, SAMA model with spatio-temporal aggregation and grounding, and SAMA-Bench benchmark, enabling unified multi-turn video referential understanding.

Findings

01

SAMA achieves state-of-the-art results on grounding benchmarks.

02

SAMA performs strongly on multi-turn video chat tasks.

03

The dataset and benchmark facilitate comprehensive evaluation of video LMMs.

Abstract

Achieving fine-grained spatio-temporal understanding in videos remains a major challenge for current Video Large Multimodal Models (Video LMMs). Addressing this challenge requires mastering two core capabilities: video referring understanding, which captures the semantics of video regions, and video grounding, which segments object regions based on natural language descriptions. However, most existing approaches tackle these tasks in isolation, limiting progress toward unified, referentially grounded video interaction. We identify a key bottleneck in the lack of high-quality, unified video instruction data and a comprehensive benchmark for evaluating referentially grounded video chat. To address these challenges, we contribute in three core aspects: dataset, model, and benchmark. First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically curated to enable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems