Discrete Audio Representations for Automated Audio Captioning
Jingguang Tian, Haoqin Sun, Xinhui Hu, Xinkang Xu

TL;DR
This paper explores the use of discrete audio tokens for automated audio captioning, finding that supervised tokenization improves performance over unsupervised methods and direct continuous audio use.
Contribution
It introduces a supervised audio tokenizer trained with an audio tagging objective, enhancing semantic understanding for AAC.
Findings
Supervised audio tokens outperform unsupervised tokens in AAC.
Using continuous audio representations yields better AAC performance than discrete tokens.
The proposed tokenizer captures audio event information effectively.
Abstract
Discrete audio representations, termed audio tokens, are broadly categorized into semantic and acoustic tokens, typically generated through unsupervised tokenization of continuous audio representations. However, their applicability to automated audio captioning (AAC) remains underexplored. This paper systematically investigates the viability of audio token-driven models for AAC through comparative analyses of various tokenization methods. Our findings reveal that audio tokenization leads to performance degradation in AAC models compared to those that directly utilize continuous audio representations. To address this issue, we introduce a supervised audio tokenizer trained with an audio tagging objective. Unlike unsupervised tokenizers, which lack explicit semantic understanding, the proposed tokenizer effectively captures audio event information. Experiments conducted on the Clotho…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
