GCT: Gated Contextual Transformer for Sequential Audio Tagging

Yuanbo Hou; Yun Wang; Wenwu Wang; Dick Botteldooren

arXiv:2210.12541·cs.SD·October 25, 2022

GCT: Gated Contextual Transformer for Sequential Audio Tagging

Yuanbo Hou, Yun Wang, Wenwu Wang, Dick Botteldooren

PDF

Open Access 1 Repo

TL;DR

This paper introduces GCT, a novel gated transformer model with forward-backward inference and a specialized GCMLP block, significantly improving sequential audio tagging accuracy over existing CTC-based methods.

Contribution

The paper proposes GCT with GCMLP and FBI, enhancing contextual modeling and bidirectional inference for sequential audio tagging.

Findings

01

GCT outperforms CTC-based methods and cTransformer on real datasets.

02

GCT with GCMLP and FBI achieves higher accuracy in audio event sequence detection.

03

Manually annotated datasets are released to support future research.

Abstract

Audio tagging aims to assign predefined tags to audio clips to indicate the class information of audio events. Sequential audio tagging (SAT) means detecting both the class information of audio events, and the order in which they occur within the audio clip. Most existing methods for SAT are based on connectionist temporal classification (CTC). However, CTC cannot effectively capture connections between events due to the conditional independence assumption between outputs at different times. The contextual Transformer (cTransformer) addresses this issue by exploiting contextual information in SAT. Nevertheless, cTransformer is also limited in exploiting contextual information as it only uses forward information in inference. This paper proposes a gated contextual Transformer (GCT) with forward-backward inference (FBI). In addition, a gated contextual multi-layer perceptron (GCMLP) block…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuanbo2020/gct
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis