TL;DR
HEAR introduces a human-inspired decoupled architecture for audio representation learning, significantly reducing computational costs while maintaining competitive performance across benchmarks.
Contribution
A novel decoupled architecture inspired by human cognition, combining local feature extraction and global context modeling with knowledge distillation for efficient audio learning.
Findings
HEAR uses only 15M parameters and 9.47 GFLOPs, much less than traditional models.
HEAR achieves competitive accuracy on various audio classification benchmarks.
The approach enables efficient Masked Audio Modeling with a lightweight architecture.
Abstract
While self-supervised learning (SSL) has revolutionized audio representation, the excessive parameterization and quadratic computational cost of standard Transformers limit their deployment on resource-constrained devices. To address this bottleneck, we propose HEAR (Human-inspired Efficient Audio Representation), a novel decoupled architecture. Inspired by the human cognitive ability to isolate local acoustic features from global context, HEAR splits the processing pipeline into two dedicated modules: an Acoustic Model for local feature extraction and a Task Model for global semantic integration. Coupled with an Acoustic Tokenizer trained via knowledge distillation, our approach enables robust Masked Audio Modeling (MAM). Extensive experiments demonstrate that HEAR requires only 15M parameters and 9.47 GFLOPs for inference, operating at a fraction of the computational cost of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
