Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning
Manh Luong, Khai Nguyen, Dinh Phung, Gholamreza Haffari, Lizhen Qu

TL;DR
This paper introduces an unbiased sliced Wasserstein RBF kernel with rotary positional embedding for audio captioning, effectively capturing temporal relationships and improving caption quality, diversity, and reasoning in audio-language tasks.
Contribution
The paper proposes a novel USW-RBF kernel with rotary embeddings that preserves temporal information and enables efficient optimization for high-quality audio captioning and reasoning.
Findings
Significant improvement in caption quality and diversity on AudioCaps and Clotho datasets.
Enhanced reasoning capabilities in large audio language models with the new kernel.
4% increase in reasoning accuracy on MMAU-test-mini benchmarks.
Abstract
Audio captioning systems face a fundamental challenge: teacher-forcing training creates exposure bias that leads to caption degeneration during inference. While contrastive methods have been proposed as solutions, they typically fail to capture the crucial temporal relationships between acoustic and linguistic modalities. We address this limitation by introducing the unbiased sliced Wasserstein RBF (USW-RBF) kernel with rotary positional embedding, specifically designed to preserve temporal information across modalities. Our approach offers a practical advantage: the kernel enables efficient stochastic gradient optimization, making it computationally feasible for real-world applications. Building on this foundation, we develop a complete audio captioning framework that integrates stochastic decoding to further mitigate caption degeneration. Extensive experiments on AudioCaps and Clotho…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsRadial Basis Function
