NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control
Yufan Wen, Zhaocheng Liu, YeGuo Hua, Ziyi Guo, Lihua Zhang, Chun Yuan, Jian Wu

TL;DR
NarraScore is a hierarchical framework that uses emotion as a high-level narrative guide to generate coherent soundtracks for long videos, leveraging vision-language models as affective sensors to improve semantic coherence.
Contribution
It introduces a novel hierarchical approach with a dual-branch injection strategy, repurposing frozen vision-language models for narrative-aware soundtrack synthesis, addressing scalability and coherence challenges.
Findings
Achieves state-of-the-art narrative alignment and consistency.
Operates with negligible computational overhead.
Effectively mitigates overfitting with a minimalist design.
Abstract
Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Music Technology and Sound Studies
