Ensemble-Guided Distillation for Compact and Robust Acoustic Scene Classification on Edge Devices
Hossein Sharify, Behnam Raoufi, Mahdy Ramezani, Khosrow Hajsadeghi, Saeed Bagheri Shouraki

TL;DR
This paper introduces a compact, robust acoustic scene classification framework using ensemble-guided knowledge distillation, optimized for edge devices, achieving state-of-the-art results on a benchmark dataset.
Contribution
It proposes a novel ensemble-guided distillation method with a lightweight student network and diverse teacher ensemble for efficient edge deployment.
Findings
Achieves state-of-the-art accuracy on TAU Urban Acoustic Scenes 2022 Mobile benchmark.
Demonstrates robustness to device and noise variability.
Enables efficient inference suitable for edge devices.
Abstract
We present a compact, quantization-ready acoustic scene classification (ASC) framework that couples an efficient student network with a learned teacher ensemble and knowledge distillation. The student backbone uses stacked depthwise-separable "expand-depthwise-project" blocks with global response normalization to stabilize training and improve robustness to device and noise variability, while a global pooling head yields class logits for efficient edge inference. To inject richer inductive bias, we assemble a diverse set of teacher models and learn two complementary fusion heads: z1, which predicts per-teacher mixture weights using a student-style backbone, and z2, a lightweight MLP that performs per-class logit fusion. The student is distilled from the ensemble via temperature-scaled soft targets combined with hard labels, enabling it to approximate the ensemble's decision geometry…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Domain Adaptation and Few-Shot Learning
