CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer
Daiki Takeuchi, Binh Thien Nguyen, Masahiro Yasuda, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada

TL;DR
CLAP-ART introduces a novel audio captioning method that uses semantic-rich discrete tokens from pre-trained audio representations, significantly improving performance over previous approaches.
Contribution
The paper proposes CLAP-ART, a new AAC approach that leverages semantic-rich tokens from pre-trained audio models, addressing limitations of waveform-focused tokenization methods.
Findings
CLAP-ART outperforms EnCLAP on two AAC benchmarks.
Semantic-rich discrete tokens improve captioning accuracy.
Pre-trained audio representations enhance semantic capture.
Abstract
Automated Audio Captioning (AAC) aims to describe the semantic contexts of general sounds, including acoustic events and scenes, by leveraging effective acoustic features. To enhance performance, an AAC method, EnCLAP, employed discrete tokens from EnCodec as an effective input for fine-tuning a language model BART. However, EnCodec is designed to reconstruct waveforms rather than capture the semantic contexts of general sounds, which AAC should describe. To address this issue, we propose CLAP-ART, an AAC method that utilizes ``semantic-rich and discrete'' tokens as input. CLAP-ART computes semantic-rich discrete tokens from pre-trained audio representations through vector quantization. We experimentally confirmed that CLAP-ART outperforms baseline EnCLAP on two AAC benchmarks, indicating that semantic-rich discrete tokens derived from semantically rich AR are beneficial for AAC.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech Recognition and Synthesis
MethodsLinear Layer · Multi-Head Attention · Attention Is All You Need · Dropout · Residual Connection · Byte Pair Encoding · Layer Normalization · Adam · Dense Connections · Softmax
