Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models
Yanda Li, Yuhan Liu, Zirui Song, Yunchao Wei, Martin Tak\'a\v{c}, Salem Lahlou

TL;DR
Temporal Contrastive Decoding (TCD) is a training-free inference method that reduces temporal smoothing bias in large audio-language models, improving their ability to utilize transient acoustic cues.
Contribution
TCD introduces a novel contrastive decoding approach that enhances unified LALMs without additional training, applicable across various model architectures.
Findings
TCD consistently improves performance on MMAU and AIR-Bench benchmarks.
The method effectively mitigates temporal smoothing bias in LALMs.
Ablation studies confirm the importance of key TCD components.
Abstract
Large audio-language models (LALMs) generalize across speech, sound, and music, but unified decoders can exhibit a \emph{temporal smoothing bias}: transient acoustic cues may be underutilized in favor of temporally smooth context that is better supported by language priors, leading to less specific audio-grounded outputs. We propose \emph{Temporal Contrastive Decoding} (TCD), a training-free decoding method for unified LALMs that mitigates this effect at inference time. TCD constructs a temporally blurred slow-path view by smoothing the input waveform and re-encoding it, then contrasts next-token logits from the original and slow-path views. The contrastive signal is applied as a token-level logit update restricted to a small candidate set. A self-normalized stability score sets the blur window and update scale, and a step-wise gate based on uncertainty and audio reliance activates the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
