HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression

Haoxuan Li; Mengyan Li; Junjun Zheng

arXiv:2601.07366·cs.CV·January 13, 2026

HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression

Haoxuan Li, Mengyan Li, Junjun Zheng

PDF

Open Access

TL;DR

This paper presents HiVid-Narrator, a hierarchical video narration framework for e-commerce videos that uses scene-primed compression and dual-granularity annotations to generate coherent, fact-grounded stories efficiently.

Contribution

It introduces the E-HVC dataset with dual-granularity annotations and proposes SPA-Compressor for token reduction, enabling improved narrative generation in dense, fast-paced videos.

Findings

01

HiVid-Narrator outperforms existing methods in narrative quality.

02

SPA-Compressor reduces input tokens while maintaining performance.

03

E-HVC dataset provides detailed annotations for e-commerce videos.

Abstract

Generating structured narrations for real-world e-commerce videos requires models to perceive fine-grained visual details and organize them into coherent, high-level stories--capabilities that existing approaches struggle to unify. We introduce the E-commerce Hierarchical Video Captioning (E-HVC) dataset with dual-granularity, temporally grounded annotations: a Temporal Chain-of-Thought that anchors event-level observations and Chapter Summary that compose them into concise, story-centric summaries. Rather than directly prompting chapters, we adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions, then refines coarse annotations into precise chapter boundaries and titles conditioned on the Temporal Chain-of-Thought, yielding fact-grounded, time-aligned narratives. We also observe that e-commerce videos are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization