Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
Tencent Hunyuan Team

TL;DR
This paper introduces MTSS, a new multimodal video captioning paradigm that factorizes and explicitly grounds scene descriptions, significantly improving understanding, reasoning, and generation quality across various models.
Contribution
MTSS replaces monolithic video descriptions with factorized, grounded streams, enhancing scalability, fidelity, and performance in multimodal video understanding and generation.
Findings
Reduces total error rate by 25% on Video-SALMONN-2
Achieves 67% performance gain on Daily-Omni benchmark
Improves cross-shot identity, audio-visual alignment, and temporal controllability in video generation
Abstract
Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites. To address this structural bottleneck, we propose Multi-Stream Scene Script (MTSS), a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. MTSS is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
