Audio Tagging With Connectionist Temporal Classification Model Using   Sequential Labelled Data

Yuanbo Hou; Qiuqiang Kong; Shengchen Li

arXiv:1808.01935·cs.SD·August 7, 2018

Audio Tagging With Connectionist Temporal Classification Model Using Sequential Labelled Data

Yuanbo Hou, Qiuqiang Kong, Shengchen Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel CRNN-CTC model for audio tagging that leverages sequential labelled data to improve accuracy and predict sound event order, outperforming traditional methods.

Contribution

It proposes using sequential labelled data with a CRNN-CTC framework for enhanced audio tagging and event order prediction.

Findings

01

Achieved an AUC score of 0.986, surpassing baseline models.

02

Demonstrated the model's ability to predict event order.

03

Outperformed existing weakly labelled data methods.

Abstract

Audio tagging aims to predict one or several labels in an audio clip. Many previous works use weakly labelled data (WLD) for audio tagging, where only presence or absence of sound events is known, but the order of sound events is unknown. To use the order information of sound events, we propose sequential labelled data (SLD), where both the presence or absence and the order information of sound events are known. To utilize SLD in audio tagging, we propose a Convolutional Recurrent Neural Network followed by a Connectionist Temporal Classification (CRNN-CTC) objective function to map from an audio clip spectrogram to SLD. Experiments show that CRNN-CTC obtains an Area Under Curve (AUC) score of 0.986 in audio tagging, outperforming the baseline CRNN of 0.908 and 0.815 with Max Pooling and Average Pooling, respectively. In addition, we show CRNN-CTC has the ability to predict the order of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iooops/CS221-Audio-Tagging
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsAverage Pooling · Max Pooling