Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

Tencent Hunyuan Team

arXiv:2604.11244·cs.CV·April 16, 2026

Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

Tencent Hunyuan Team

PDF

TL;DR

This paper introduces MTSS, a new multimodal video captioning paradigm that factorizes and explicitly grounds scene descriptions, significantly improving understanding, reasoning, and generation quality across various models.

Contribution

MTSS replaces monolithic video descriptions with factorized, grounded streams, enhancing scalability, fidelity, and performance in multimodal video understanding and generation.

Findings

01

Reduces total error rate by 25% on Video-SALMONN-2

02

Achieves 67% performance gain on Daily-Omni benchmark

03

Improves cross-shot identity, audio-visual alignment, and temporal controllability in video generation

Abstract

Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites. To address this structural bottleneck, we propose Multi-Stream Scene Script (MTSS), a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. MTSS is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.