TL;DR
This paper demonstrates that scaling transformer architectures with a Finite Scalar Quantization bottleneck can achieve state-of-the-art low-bitrate speech coding quality, surpassing existing methods in objective and subjective evaluations.
Contribution
It introduces a large-scale transformer model with FSQ bottleneck for low-bitrate speech coding, achieving unprecedented speech quality at extremely low bit-rates.
Findings
Outperforms existing baselines in objective tests
Achieves state-of-the-art speech quality at 400-700 bits/sec
Demonstrates effectiveness of scaled transformers in speech tokenization
Abstract
The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context. Traditionally such tokenization models have concentrated on low parameter-count architectures using only components with strong inductive biases. In this work we show that by scaling a transformer architecture with large parameter count to this problem, and applying a flexible Finite Scalar Quantization (FSQ) based bottleneck, it is possible to reach state-of-the-art speech quality at extremely low bit-rates of or bits-per-second. The trained models strongly out-perform existing baselines in both objective and subjective tests.
Peer Reviews
Decision·ICLR 2025 Poster
- The proposed system appears to work well according to the objective metrics and subjective tests. - The proposed FSQ idea seems to be a solid quantization option, improving the codebook utilization. - The authors put a lot of effort in making it more scalable by adding multiple levels of quantization.
- The proposed method relies on the dimension reduction part for its dimension-specific scalar quantization to work. And that's why they could achieve higher codebook utilization. Meanwhile, there is also a trend that higher codebook utilization leads to lower coding gain if entropy coding is applied after tokenization. Indeed, the paper does not mention anything about Huffman coding results, which the proposed method might not be able to take advantage of due to the low dimensionality and high
This paper is well written and very clear to follow. In the introduction part, it clearly presents the motivations and has an excellent survey of the existing methods. Though using transformers to scale and leverage FSQ for high codebook utilization is not something new, this paper presents the motivations of these changes, the associated challenges and their mitigations. This paper also introduces a new method so that FSQ can be used in a similar way as RVQ where a varying bits-per-second rat
If I understand the proposed model correctly, it is based on transformer layer with a local attention of 128 (both left and right), which means different from DAC/Encodec/Mimi etc which use causal encoders, the encoder in the proposed method is not causal, and it will introduce a latency up to the patch length (which is 320/16k ~ 20ms?). It would be great if the author can present the results with causal encoder so that it can be compared with DAC/Encodec/Mimi in a relative fair comparison (apar
- The idea of using Transformer and the main architecture for the neural audio codec learning is novel and well executed. - Judging from the audio samples on the demo page and MOS study, TAAE is clearly state-of-the-art in low bit rate speech compression. - This paper provided a lot of detailed knowledge, empirical findings, and engineering improvements that can truly benefit the audio codec research community. I personally learned a lot in the details such as the discussion on systematic bias o
Given the main contribution of this work is in exploring an alternative architecture for codec models, completeness in terms of design details and reproducibility are expected. In contrast, I found a lot of details missing or vague. (Although the authors state the code will be released later, the paper itself should still be comprehensive alone.) Here are some examples: --- > ($\S$2.1) ... Instead we raise the $\epsilon$ constant used in the calculation of normalization factors in the layer no
Code & Models
Videos
Taxonomy
TopicsAdvanced Data Compression Techniques · Speech and Audio Processing · Speech Recognition and Synthesis
