HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression
Haoxuan Li, Mengyan Li, Junjun Zheng

TL;DR
This paper presents HiVid-Narrator, a hierarchical video narration framework for e-commerce videos that uses scene-primed compression and dual-granularity annotations to generate coherent, fact-grounded stories efficiently.
Contribution
It introduces the E-HVC dataset with dual-granularity annotations and proposes SPA-Compressor for token reduction, enabling improved narrative generation in dense, fast-paced videos.
Findings
HiVid-Narrator outperforms existing methods in narrative quality.
SPA-Compressor reduces input tokens while maintaining performance.
E-HVC dataset provides detailed annotations for e-commerce videos.
Abstract
Generating structured narrations for real-world e-commerce videos requires models to perceive fine-grained visual details and organize them into coherent, high-level stories--capabilities that existing approaches struggle to unify. We introduce the E-commerce Hierarchical Video Captioning (E-HVC) dataset with dual-granularity, temporally grounded annotations: a Temporal Chain-of-Thought that anchors event-level observations and Chapter Summary that compose them into concise, story-centric summaries. Rather than directly prompting chapters, we adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions, then refines coarse annotations into precise chapter boundaries and titles conditioned on the Temporal Chain-of-Thought, yielding fact-grounded, time-aligned narratives. We also observe that e-commerce videos are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
