Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection

Seohyun Joo; Yoori Oh

arXiv:2602.03891·eess.AS·February 6, 2026

Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection

Seohyun Joo, Yoori Oh

PDF

Open Access

TL;DR

This paper introduces DAViHD, a dual-pathway audio encoder that enhances audio-visual video highlight detection by capturing both semantic content and dynamic spectro-temporal features, leading to state-of-the-art results.

Contribution

The paper proposes a novel dual-pathway audio encoder with semantic and dynamic pathways, improving the utilization of audio cues in highlight detection models.

Findings

01

Achieves state-of-the-art performance on MrHiSum benchmark.

02

Demonstrates the importance of dynamic audio features for highlight detection.

03

Validates the effectiveness of dual-pathway encoding approach.

Abstract

Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Music and Audio Processing · Speech and Audio Processing