AudioInceptionNeXt: TCL AI LAB Submission to EPIC-SOUND   Audio-Based-Interaction-Recognition Challenge 2023

Kin Wai Lau; Yasar Abbas Ur Rehman; Yuyang Xie; Lan Ma

arXiv:2307.07265·cs.SD·July 17, 2023·2 cites

AudioInceptionNeXt: TCL AI LAB Submission to EPIC-SOUND Audio-Based-Interaction-Recognition Challenge 2023

Kin Wai Lau, Yasar Abbas Ur Rehman, Yuyang Xie, Lan Ma

PDF

Open Access 1 Repo

TL;DR

This paper introduces AudioInceptionNeXt, a CNN-based model utilizing multi-scale depthwise separable convolutions on spectrograms, achieving top accuracy in audio-based interaction recognition for the EPIC-SOUND challenge.

Contribution

The novel AudioInceptionNeXt architecture effectively captures multi-scale temporal and frequency features using parallel separable convolutions, advancing audio recognition performance.

Findings

01

Achieved 55.43% top-1 accuracy on the challenge test set.

02

Ranked 1st on the public leaderboard.

03

Demonstrated effectiveness of multi-scale depthwise separable convolutions.

Abstract

This report presents the technical details of our submission to the 2023 Epic-Kitchen EPIC-SOUNDS Audio-Based Interaction Recognition Challenge. The task is to learn the mapping from audio samples to their corresponding action labels. To achieve this goal, we propose a simple yet effective single-stream CNN-based architecture called AudioInceptionNeXt that operates on the time-frequency log-mel-spectrogram of the audio samples. Motivated by the design of the InceptionNeXt, we propose parallel multi-scale depthwise separable convolutional kernels in the AudioInceptionNeXt block, which enable the model to learn the time and frequency information more effectively. The large-scale separable kernels capture the long duration of activities and the global frequency semantic information, while the small-scale separable kernels capture the short duration of activities and local details of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stevenlauhkhk/audioinceptionnext
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Human Pose and Action Recognition · Speech Recognition and Synthesis