How Contrastive Decoding Enhances Large Audio Language Models?

Tzu-Quan Lin; Wei-Ping Huang; Yi-Cheng Lin; Hung-yi Lee

arXiv:2603.09232·cs.SD·March 11, 2026

How Contrastive Decoding Enhances Large Audio Language Models?

Tzu-Quan Lin, Wei-Ping Huang, Yi-Cheng Lin, Hung-yi Lee

PDF

Open Access

TL;DR

This paper systematically evaluates contrastive decoding strategies for large audio language models, introducing a framework to understand their effectiveness and guiding model selection based on error profiles.

Contribution

It identifies the most effective contrastive decoding strategies, introduces a Transition Matrix framework to analyze error correction, and provides guidelines for model suitability.

Findings

01

Audio-Aware Decoding and Audio Contrastive Decoding are most effective.

02

Contrastive Decoding corrects errors related to audio absence and uncertainty.

03

It does not improve flawed reasoning or confident errors.

Abstract

While Contrastive Decoding (CD) has proven effective at enhancing Large Audio Language Models (LALMs), the underlying mechanisms driving its success and the comparative efficacy of different strategies remain unclear. This study systematically evaluates four distinct CD strategies across diverse LALM architectures. We identify Audio-Aware Decoding and Audio Contrastive Decoding as the most effective methods. However, their impact varies significantly by model. To explain this variability, we introduce a Transition Matrix framework to map error pattern shifts during inference. Our analysis demonstrates that CD reliably rectifies errors in which models falsely claim an absence of audio or resort to uncertainty-driven guessing. Conversely, it fails to correct flawed reasoning or confident misassertions. Ultimately, these findings provide a clear guideline for determining which LALM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Computational and Text Analysis Methods