Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

Manh Luong; Khai Nguyen; Dinh Phung; Gholamreza Haffari; Lizhen Qu

arXiv:2502.05435·eess.AS·February 27, 2026

Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

Manh Luong, Khai Nguyen, Dinh Phung, Gholamreza Haffari, Lizhen Qu

PDF

Open Access 1 Video

TL;DR

This paper introduces an unbiased sliced Wasserstein RBF kernel with rotary positional embedding for audio captioning, effectively capturing temporal relationships and improving caption quality, diversity, and reasoning in audio-language tasks.

Contribution

The paper proposes a novel USW-RBF kernel with rotary embeddings that preserves temporal information and enables efficient optimization for high-quality audio captioning and reasoning.

Findings

01

Significant improvement in caption quality and diversity on AudioCaps and Clotho datasets.

02

Enhanced reasoning capabilities in large audio language models with the new kernel.

03

4% increase in reasoning accuracy on MMAU-test-mini benchmarks.

Abstract

Audio captioning systems face a fundamental challenge: teacher-forcing training creates exposure bias that leads to caption degeneration during inference. While contrastive methods have been proposed as solutions, they typically fail to capture the crucial temporal relationships between acoustic and linguistic modalities. We address this limitation by introducing the unbiased sliced Wasserstein RBF (USW-RBF) kernel with rotary positional embedding, specifically designed to preserve temporal information across modalities. Our approach offers a practical advantage: the kernel enables efficient stochastic gradient optimization, making it computationally feasible for real-world applications. Building on this foundation, we develop a complete audio captioning framework that integrates stochastic decoding to further mitigate caption degeneration. Extensive experiments on AudioCaps and Clotho…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsRadial Basis Function