OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
Junfu Pu, Yuxin Chen, Teng Wang, Ying Shan

TL;DR
This paper introduces OmniScript, an 8B-parameter audio-visual model designed for generating detailed, scene-by-scene scripts from long-form cinematic videos, advancing the understanding of complex narratives.
Contribution
It presents a new V2S task, a human-annotated benchmark, and a hierarchical evaluation framework, along with a novel training pipeline for long-form video script generation.
Findings
OmniScript outperforms larger open-source models in temporal localization.
OmniScript achieves comparable performance to state-of-the-art proprietary models.
The proposed benchmark and evaluation framework facilitate progress in long-form video understanding.
Abstract
Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
