Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection
Seohyun Joo, Yoori Oh

TL;DR
This paper introduces DAViHD, a dual-pathway audio encoder that enhances audio-visual video highlight detection by capturing both semantic content and dynamic spectro-temporal features, leading to state-of-the-art results.
Contribution
The paper proposes a novel dual-pathway audio encoder with semantic and dynamic pathways, improving the utilization of audio cues in highlight detection models.
Findings
Achieves state-of-the-art performance on MrHiSum benchmark.
Demonstrates the importance of dynamic audio features for highlight detection.
Validates the effectiveness of dual-pathway encoding approach.
Abstract
Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Music and Audio Processing · Speech and Audio Processing
