AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition
Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po

TL;DR
AudioRepInceptionNeXt is a lightweight, efficient neural network architecture for audio recognition that reduces computational costs by over 50% and speeds up inference by 1.28 times while maintaining accuracy, suitable for edge devices.
Contribution
The paper introduces AudioRepInceptionNeXt, a novel single-stream architecture with cascaded multi-scale depth-wise convolutions, inspired by efficient vision models, optimized for audio recognition tasks.
Findings
Reduces parameters and computations by over 50%.
Improves inference speed by 1.28 times over state-of-the-art CNNs.
Maintains comparable accuracy across various audio tasks.
Abstract
Recent research has successfully adapted vision-based convolutional neural network (CNN) architectures for audio recognition tasks using Mel-Spectrograms. However, these CNNs have high computational costs and memory requirements, limiting their deployment on low-end edge devices. Motivated by the success of efficient vision models like InceptionNeXt and ConvNeXt, we propose AudioRepInceptionNeXt, a single-stream architecture. Its basic building block breaks down the parallel multi-branch depth-wise convolutions with descending scales of k x k kernels into a cascade of two multi-branch depth-wise convolutions. The first multi-branch consists of parallel multi-scale 1 x k depth-wise convolutional layers followed by a similar multi-branch employing parallel multi-scale k x 1 depth-wise convolutional layers. This reduces computational and memory footprint while separating time and frequency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
