NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Yufan Wen; Zhaocheng Liu; YeGuo Hua; Ziyi Guo; Lihua Zhang; Chun Yuan; Jian Wu

arXiv:2602.09070·cs.SD·February 13, 2026

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Yufan Wen, Zhaocheng Liu, YeGuo Hua, Ziyi Guo, Lihua Zhang, Chun Yuan, Jian Wu

PDF

Open Access

TL;DR

NarraScore is a hierarchical framework that uses emotion as a high-level narrative guide to generate coherent soundtracks for long videos, leveraging vision-language models as affective sensors to improve semantic coherence.

Contribution

It introduces a novel hierarchical approach with a dual-branch injection strategy, repurposing frozen vision-language models for narrative-aware soundtrack synthesis, addressing scalability and coherence challenges.

Findings

01

Achieves state-of-the-art narrative alignment and consistency.

02

Operates with negligible computational overhead.

03

Effectively mitigates overfitting with a minimalist design.

Abstract

Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Music Technology and Sound Studies