Self-supervised Audio Teacher-Student Transformer for Both Clip-level   and Frame-level Tasks

Xian Li; Nian Shao; and Xiaofei Li

arXiv:2306.04186·eess.AS·November 8, 2023·1 cites

Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks

Xian Li, Nian Shao, and Xiaofei Li

PDF

Open Access 2 Repos

TL;DR

This paper introduces a self-supervised Transformer-based audio learning framework, ATST, capable of effectively handling both clip-level and frame-level tasks, achieving state-of-the-art results especially in sound event detection.

Contribution

The paper presents a novel Audio Teacher-Student Transformer framework with separate models for clip and frame-level tasks, employing specialized data augmentation strategies and knowledge distillation.

Findings

01

ATST-Frame achieves state-of-the-art performance on frame-level tasks.

02

Combining ATST-Clip and ATST-Frame improves downstream task results.

03

The models outperform previous methods, especially in sound event detection.

Abstract

Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. While frame-level tasks are important for fine-grained acoustic scene/event understanding, prior studies primarily evaluate on clip-level downstream tasks. In order to tackle both clip-level and frame-level tasks, this paper proposes Audio Teacher-Student Transformer (ATST), with a clip-level version (named ATST-Clip) and a frame-level version (named ATST-Frame), responsible for learning clip-level and frame-level representations, respectively. Both methods use a Transformer encoder and a teacher-student training scheme. We have carefully designed the view creation strategy for ATST-Clip and ATST-Frame. Specifically, ATST-Clip uses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Residual Connection · Linear Layer · Label Smoothing · Adam · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Layer Normalization