Exploring Token-Space Manipulation in Latent Audio Tokenizers
Francesco Paissan, Luca Della Libera, Mirco Ravanelli, Cem Subakan

TL;DR
This paper introduces LATTE, a novel latent audio tokenizer that enables global attribute manipulation in speech without supervision, while maintaining competitive low-bitrate reconstruction quality.
Contribution
LATTE appends learnable latent tokens to audio features, creating a global bottleneck that allows effective global attribute editing in speech generation.
Findings
Swapping latent tokens modifies speaker identity and background noise.
LATTE achieves competitive reconstruction quality at low bitrates.
Token-space interventions enable controllable speech manipulation.
Abstract
Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token-space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non-temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low-bitrate speech coding settings while enabling simple token-space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
