TeaserGen: Generating Teasers for Long Documentaries
Weihan Xu, Paul Pu Liang, Haven Kim, Julian McAuley, Taylor, Berg-Kirkpatrick, Hao-Wen Dong

TL;DR
This paper introduces DocumentaryNet, a new dataset of documentaries and teasers, and proposes TeaserGen, a two-stage system that generates teaser narrations and selects relevant visuals for long documentaries.
Contribution
The work provides the first large-scale documentary-teaser dataset and develops a novel two-stage teaser generation system utilizing multimodal data and pretrained language-vision models.
Findings
Pretraining-based models outperform deep autoregressive models in visual relevance.
DocumentaryNet enables research in long-form multimodal content summarization.
TeaserGen effectively combines narration generation and visual content selection.
Abstract
Teasers are an effective tool for promoting content in entertainment, commercial and educational fields. However, creating an effective teaser for long videos is challenging for it requires long-range multimodal modeling on the input videos, while necessitating maintaining audiovisual alignments, managing scene changes and preserving factual accuracy for the output teasers. Due to the lack of a publicly-available dataset, progress along this research direction has been hindered. In this work, we present DocumentaryNet, a collection of 1,269 documentaries paired with their teasers, featuring multimodal data streams of video, speech, music, sound effects and narrations. With DocumentaryNet, we propose a new two-stage system for generating teasers from long documentaries. The proposed TeaserGen system first generates the teaser narration from the transcribed narration of the documentary…
Peer Reviews
Decision·ICLR 2025 Poster
- The newly introduced dataset could be valuable for the research community. - The task is interesting and meaningful. - The experiments are thorough, and the performance appears good.
- The system heavily relies on LLMs and vision-language models, which may lead to error accumulation. How can we evaluate whether the teaser narration generated by GPT is effective? - Some video summarization and highlight detection methods could also be applied to generate teasers, but the paper lacks a comparison with these approaches.
1. The paper proposes frameworks to generate teasers from documentary using audiovisual alignments and scene-changes. 2. The paper demonstrates robust experiments and comparisons to baseline models. 3. The authors use thorough evaluation on their dataset using both objective metrics (like F1 score and scene change rate) and subjective evaluations (coherence, engagingness) to validate their results. 4. The paper has shown extensive ablation studies.
1. Limited dataset scale, both for training and testing. Though the dataset is domain specific, still the scale is limited with just 1.2k documentaries. 2. Reliance on pretrained LLM, for teaser narration generation without any check for hallucinations or error compounding due to this step. 3. The work is very domain specific, the framework’s reliance on pretrained language-vision models for narration-video alignment, while effective for documentaries, may struggle with complex visual elements t
- Tackles a unique problem in automated teaser generation for documentaries with TeaserGen, a creative, narration-centered two-stage approach that combines large language models with language-vision models for cohesive narration and visual alignment, showing effective and innovative use of existing technologies. - Provides solid empirical support with comparisons to baseline models across objective (e.g., F1 score, CLIPScore) and subjective metrics, as well as the introduction of DocumentaryNet
- The paper presents an innovative approach to video teaser generation using pretrained language-vision models, but several issues need to be addressed to enhance its clarity and robustness. In rows 210 and 211, there is a notation inconsistency where $S$ is defined as a sequence of language tokens, yet later each $S_i$ is referred to as a waveform (audio signal). This inconsistency creates confusion, and it's crucial for the notation to consistently represent either language tokens or audio wav
Videos
Taxonomy
TopicsDigital Humanities and Scholarship
