Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Julian D Parker; Anton Smirnov; Jordi Pons; CJ Carr; Zack Zukowski,; Zach Evans; Xubo Liu

arXiv:2411.19842·eess.AS·December 2, 2024

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Julian D Parker, Anton Smirnov, Jordi Pons, CJ Carr, Zack Zukowski,, Zach Evans, Xubo Liu

PDF

Open Access 1 Repo 2 Models 1 Video 3 Reviews

TL;DR

This paper demonstrates that scaling transformer architectures with a Finite Scalar Quantization bottleneck can achieve state-of-the-art low-bitrate speech coding quality, surpassing existing methods in objective and subjective evaluations.

Contribution

It introduces a large-scale transformer model with FSQ bottleneck for low-bitrate speech coding, achieving unprecedented speech quality at extremely low bit-rates.

Findings

01

Outperforms existing baselines in objective tests

02

Achieves state-of-the-art speech quality at 400-700 bits/sec

03

Demonstrates effectiveness of scaled transformers in speech tokenization

Abstract

The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context. Traditionally such tokenization models have concentrated on low parameter-count architectures using only components with strong inductive biases. In this work we show that by scaling a transformer architecture with large parameter count to this problem, and applying a flexible Finite Scalar Quantization (FSQ) based bottleneck, it is possible to reach state-of-the-art speech quality at extremely low bit-rates of $400$ or $700$ bits-per-second. The trained models strongly out-perform existing baselines in both objective and subjective tests.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The proposed system appears to work well according to the objective metrics and subjective tests. - The proposed FSQ idea seems to be a solid quantization option, improving the codebook utilization. - The authors put a lot of effort in making it more scalable by adding multiple levels of quantization.

Weaknesses

- The proposed method relies on the dimension reduction part for its dimension-specific scalar quantization to work. And that's why they could achieve higher codebook utilization. Meanwhile, there is also a trend that higher codebook utilization leads to lower coding gain if entropy coding is applied after tokenization. Indeed, the paper does not mention anything about Huffman coding results, which the proposed method might not be able to take advantage of due to the low dimensionality and high

Reviewer 02Rating 8Confidence 5

Strengths

This paper is well written and very clear to follow. In the introduction part, it clearly presents the motivations and has an excellent survey of the existing methods. Though using transformers to scale and leverage FSQ for high codebook utilization is not something new, this paper presents the motivations of these changes, the associated challenges and their mitigations. This paper also introduces a new method so that FSQ can be used in a similar way as RVQ where a varying bits-per-second rat

Weaknesses

If I understand the proposed model correctly, it is based on transformer layer with a local attention of 128 (both left and right), which means different from DAC/Encodec/Mimi etc which use causal encoders, the encoder in the proposed method is not causal, and it will introduce a latency up to the patch length (which is 320/16k ~ 20ms?). It would be great if the author can present the results with causal encoder so that it can be compared with DAC/Encodec/Mimi in a relative fair comparison (apar

Reviewer 03Rating 8Confidence 5

Strengths

- The idea of using Transformer and the main architecture for the neural audio codec learning is novel and well executed. - Judging from the audio samples on the demo page and MOS study, TAAE is clearly state-of-the-art in low bit rate speech compression. - This paper provided a lot of detailed knowledge, empirical findings, and engineering improvements that can truly benefit the audio codec research community. I personally learned a lot in the details such as the discussion on systematic bias o

Weaknesses

Given the main contribution of this work is in exploring an alternative architecture for codec models, completeness in terms of design details and reproducibility are expected. In contrast, I found a lot of details missing or vague. (Although the authors state the code will be released later, the paper itself should still be comprehensive alone.) Here are some examples: --- > ($\S$2.1) ... Instead we raise the $\epsilon$ constant used in the calculation of normalization factors in the layer no

Code & Models

Repositories

Stability-AI/stable-codec
pytorchOfficial

Models

Videos

Scaling Transformers for Low-Bitrate High-Quality Speech Coding· slideslive

Taxonomy

TopicsAdvanced Data Compression Techniques · Speech and Audio Processing · Speech Recognition and Synthesis