Discrete Audio Representations for Automated Audio Captioning

Jingguang Tian; Haoqin Sun; Xinhui Hu; Xinkang Xu

arXiv:2505.14989·cs.SD·May 22, 2025

Discrete Audio Representations for Automated Audio Captioning

Jingguang Tian, Haoqin Sun, Xinhui Hu, Xinkang Xu

PDF

Open Access

TL;DR

This paper explores the use of discrete audio tokens for automated audio captioning, finding that supervised tokenization improves performance over unsupervised methods and direct continuous audio use.

Contribution

It introduces a supervised audio tokenizer trained with an audio tagging objective, enhancing semantic understanding for AAC.

Findings

01

Supervised audio tokens outperform unsupervised tokens in AAC.

02

Using continuous audio representations yields better AAC performance than discrete tokens.

03

The proposed tokenizer captures audio event information effectively.

Abstract

Discrete audio representations, termed audio tokens, are broadly categorized into semantic and acoustic tokens, typically generated through unsupervised tokenization of continuous audio representations. However, their applicability to automated audio captioning (AAC) remains underexplored. This paper systematically investigates the viability of audio token-driven models for AAC through comparative analyses of various tokenization methods. Our findings reveal that audio tokenization leads to performance degradation in AAC models compared to those that directly utilize continuous audio representations. To address this issue, we introduce a supervised audio tokenizer trained with an audio tagging objective. Unlike unsupervised tokenizers, which lack explicit semantic understanding, the proposed tokenizer effectively captures audio event information. Experiments conducted on the Clotho…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization