Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding

Hoseong Ahn; Jeongyun Chae; Yoonji Park; Kyuhong Shim

arXiv:2603.06193·cs.SD·March 9, 2026

Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding

Hoseong Ahn, Jeongyun Chae, Yoonji Park, Kyuhong Shim

PDF

Open Access

TL;DR

Whisper-CD is a training-free contrastive decoding method that significantly improves long-form speech recognition accuracy and speed by reducing errors like hallucinations and repetitions, without retraining existing models.

Contribution

It introduces a novel multi-negative contrastive decoding framework that enhances Whisper's long-form speech recognition performance at inference time.

Findings

01

Reduces WER by up to 24.3 percentage points on benchmarks.

02

Achieves 48% faster token generation than beam search.

03

Operates as a drop-in replacement without retraining.

Abstract

Long-form speech recognition with large encoder-decoder models such as Whisper often exhibit hallucinations, repetition loops, and content omissions. These errors can accumulate and be further amplified when the previous segment's transcription is used as decoding context. We propose Whisper-CD, a training-free contrastive decoding framework that contrasts clean-audio logits against negative logits computed from three acoustically motivated perturbations: Gaussian noise injection, silence signal, and audio temporal shift. We aggregate these negatives via the log-sum-exp operator, building a unified multi-negative objective for token-by-token decoding. Across five English long-form benchmarks, Whisper-CD reduces WER by up to 24.3pp on CORAAL and shows 48% faster token generation throughput than beam search. Because Whisper-CD operates purely at inference time, it can be applied as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques