CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer

Daiki Takeuchi; Binh Thien Nguyen; Masahiro Yasuda; Yasunori Ohishi; Daisuke Niizumi; Noboru Harada

arXiv:2506.00800·eess.AS·June 3, 2025

CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer

Daiki Takeuchi, Binh Thien Nguyen, Masahiro Yasuda, Yasunori Ohishi, Daisuke Niizumi, Noboru Harada

PDF

Open Access

TL;DR

CLAP-ART introduces a novel audio captioning method that uses semantic-rich discrete tokens from pre-trained audio representations, significantly improving performance over previous approaches.

Contribution

The paper proposes CLAP-ART, a new AAC approach that leverages semantic-rich tokens from pre-trained audio models, addressing limitations of waveform-focused tokenization methods.

Findings

01

CLAP-ART outperforms EnCLAP on two AAC benchmarks.

02

Semantic-rich discrete tokens improve captioning accuracy.

03

Pre-trained audio representations enhance semantic capture.

Abstract

Automated Audio Captioning (AAC) aims to describe the semantic contexts of general sounds, including acoustic events and scenes, by leveraging effective acoustic features. To enhance performance, an AAC method, EnCLAP, employed discrete tokens from EnCodec as an effective input for fine-tuning a language model BART. However, EnCodec is designed to reconstruct waveforms rather than capture the semantic contexts of general sounds, which AAC should describe. To address this issue, we propose CLAP-ART, an AAC method that utilizes ``semantic-rich and discrete'' tokens as input. CLAP-ART computes semantic-rich discrete tokens from pre-trained audio representations through vector quantization. We experimentally confirmed that CLAP-ART outperforms baseline EnCLAP on two AAC benchmarks, indicating that semantic-rich discrete tokens derived from semantically rich AR are beneficial for AAC.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech Recognition and Synthesis

MethodsLinear Layer · Multi-Head Attention · Attention Is All You Need · Dropout · Residual Connection · Byte Pair Encoding · Layer Normalization · Adam · Dense Connections · Softmax