GCT: Gated Contextual Transformer for Sequential Audio Tagging
Yuanbo Hou, Yun Wang, Wenwu Wang, Dick Botteldooren

TL;DR
This paper introduces GCT, a novel gated transformer model with forward-backward inference and a specialized GCMLP block, significantly improving sequential audio tagging accuracy over existing CTC-based methods.
Contribution
The paper proposes GCT with GCMLP and FBI, enhancing contextual modeling and bidirectional inference for sequential audio tagging.
Findings
GCT outperforms CTC-based methods and cTransformer on real datasets.
GCT with GCMLP and FBI achieves higher accuracy in audio event sequence detection.
Manually annotated datasets are released to support future research.
Abstract
Audio tagging aims to assign predefined tags to audio clips to indicate the class information of audio events. Sequential audio tagging (SAT) means detecting both the class information of audio events, and the order in which they occur within the audio clip. Most existing methods for SAT are based on connectionist temporal classification (CTC). However, CTC cannot effectively capture connections between events due to the conditional independence assumption between outputs at different times. The contextual Transformer (cTransformer) addresses this issue by exploiting contextual information in SAT. Nevertheless, cTransformer is also limited in exploiting contextual information as it only uses forward information in inference. This paper proposes a gated contextual Transformer (GCT) with forward-backward inference (FBI). In addition, a gated contextual multi-layer perceptron (GCMLP) block…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
