Play It Back: Iterative Attention for Audio Recognition
Alexandros Stergiou, Dima Damen

TL;DR
This paper introduces an attention-based model that iteratively replays and refines discriminative audio segments to improve classification accuracy, achieving state-of-the-art results on multiple benchmarks.
Contribution
The proposed model employs iterative attention and selective replay of audio segments, a novel approach enhancing fine-grained audio recognition performance.
Findings
Achieves state-of-the-art accuracy on AudioSet, VGG-Sound, and EPIC-KITCHENS-100.
Effectively refines focus on discriminative sounds through iterative replay.
Demonstrates the benefit of selective segment replay for audio classification.
Abstract
A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time. Humans attempting to discriminate between fine-grained audio categories, often replay the same discriminative sounds to increase their prediction confidence. We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds across the audio sequence. Our model initially uses the full audio sequence and iteratively refines the temporal segments replayed based on slot attention. At each playback, the selected segments are replayed using a smaller hop length which represents higher resolution features within these segments. We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks: AudioSet, VGG-Sound, and EPIC-KITCHENS-100.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
