OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Junfu Pu; Yuxin Chen; Teng Wang; Ying Shan

arXiv:2604.11102·cs.CV·April 14, 2026

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Junfu Pu, Yuxin Chen, Teng Wang, Ying Shan

PDF

TL;DR

This paper introduces OmniScript, an 8B-parameter audio-visual model designed for generating detailed, scene-by-scene scripts from long-form cinematic videos, advancing the understanding of complex narratives.

Contribution

It presents a new V2S task, a human-annotated benchmark, and a hierarchical evaluation framework, along with a novel training pipeline for long-form video script generation.

Findings

01

OmniScript outperforms larger open-source models in temporal localization.

02

OmniScript achieves comparable performance to state-of-the-art proprietary models.

03

The proposed benchmark and evaluation framework facilitate progress in long-form video understanding.

Abstract

Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.