Audio Tagging With Connectionist Temporal Classification Model Using Sequential Labelled Data
Yuanbo Hou, Qiuqiang Kong, Shengchen Li

TL;DR
This paper introduces a novel CRNN-CTC model for audio tagging that leverages sequential labelled data to improve accuracy and predict sound event order, outperforming traditional methods.
Contribution
It proposes using sequential labelled data with a CRNN-CTC framework for enhanced audio tagging and event order prediction.
Findings
Achieved an AUC score of 0.986, surpassing baseline models.
Demonstrated the model's ability to predict event order.
Outperformed existing weakly labelled data methods.
Abstract
Audio tagging aims to predict one or several labels in an audio clip. Many previous works use weakly labelled data (WLD) for audio tagging, where only presence or absence of sound events is known, but the order of sound events is unknown. To use the order information of sound events, we propose sequential labelled data (SLD), where both the presence or absence and the order information of sound events are known. To utilize SLD in audio tagging, we propose a Convolutional Recurrent Neural Network followed by a Connectionist Temporal Classification (CRNN-CTC) objective function to map from an audio clip spectrogram to SLD. Experiments show that CRNN-CTC obtains an Area Under Curve (AUC) score of 0.986 in audio tagging, outperforming the baseline CRNN of 0.908 and 0.815 with Max Pooling and Average Pooling, respectively. In addition, we show CRNN-CTC has the ability to predict the order of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsAverage Pooling · Max Pooling
