TL;DR
This paper introduces LLM-Codec, a neural audio codec trained with language model objectives to improve token predictability and semantic alignment, enhancing speech coherence and reducing perplexity.
Contribution
It proposes a novel training method for neural audio codecs that aligns them better with language models without changing their architecture.
Findings
Token LMs trained on LLM-Codec reach 61.6% accuracy, a 12.1 point improvement over AUV.
Reduces perplexity by 35 on SALMon speech coherence task.
Improves speech Mel distance by 5.0% on Codec-SUPERB-tiny.
Abstract
Neural audio codecs are widely used as tokenizers for spoken language models, but they are optimized for waveform reconstruction rather than autoregressive prediction. This mismatch injects acoustically driven uncertainty into the discrete token space and increases language-model perplexity. We propose \ours, which augments codec training with language-model-facing objectives while keeping both codec and LLM architectures unchanged. \ours introduces (i) future token prediction with Medusa-style multi-step heads to encourage multi-step predictability, and (ii) semantic alignment that matches audio and text representations via a memory-bank contrastive loss. A differentiable Gumbel bridge enables end-to-end gradients from these objectives to the codec encoder. On SALMon speech coherence, token LMs trained on \ours reach 61.6% accuracy (+12.1 points over AUV) while reducing perplexity 35.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
