Two-Dimensional Quantization for Geometry-Aware Audio Coding

Tal Shuster; Eliya Nachmani

arXiv:2512.01537·cs.SD·May 19, 2026

Two-Dimensional Quantization for Geometry-Aware Audio Coding

Tal Shuster, Eliya Nachmani

PDF

5 Reviews

TL;DR

This paper introduces Two-Dimensional Quantization (Q2D2), a novel geometric quantization scheme that enhances audio coding efficiency by better capturing feature correlations, leading to improved compression and reconstruction quality.

Contribution

Q2D2 employs structured 2D grids for feature quantization, improving codebook utilization and compression efficiency over traditional methods.

Findings

01

Q2D2 achieves competitive or superior reconstruction metrics.

02

It maintains low token rates and high codebook utilization.

03

Extensive experiments validate its effectiveness across speech, audio, and music.

Abstract

Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two-Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids, such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 2Confidence 4

Strengths

- Proposed an interesting scalar quantization method called Q2D2, and used this quanitzation method to train an audio tokenizer that showed competitive performance

Weaknesses

## Experimental problems - Although the proposed Q2D2 is inspired by FSQ, Q2D2 is never compared with FSQ. - It is impossible for a reader to know if Q2D2 really improved FSQ - Missing baselines - While the authors say WavTokenizer is the SOTA for single-layer audio tokenizer, BigCodec https://arxiv.org/abs/2409.05377 shall be another important baseline. - StableCodec https://arxiv.org/abs/2411.19842v1, a single-layer audio tokenizer with FSQ, is also ignored in this paper. - XCo

Reviewer 02Rating 4Confidence 3

Strengths

1. The idea of using two-dimensional quantization to improve the representational capacity of FSQ is novel and well-motivated. 2. The illustrations and explanations of the method, particularly the comparisons to VQ and FSQ, are clear and easy to follow. 3. The experiments on audio compression are concise and demonstrate the effectiveness of Q2D2 in improving speech coding performance at low token rates.

Weaknesses

1. The main weakness lies in the experimental design. Although the paper’s primary contribution is a new quantization approach, no experiments directly compare Q2D2 with VQ and FSQ under the same framework. 2. Q2D2 quantizes pairs of encoder output features using a fixed 2D grid, which is conceptually related to product quantization (PQ). The paper could clarify more explicitly how Q2D2 relates to VQ, FSQ, and PQ, highlighting similarities and differences.

Reviewer 03Rating 2Confidence 4

Strengths

- The paper explores an under-examined direction of combining robustness and expressiveness in quantisation. - The proposal to use rhombic grids is novel and potentially beneficial, as shown in the experimental results. - The proposed method delivers results that are at least comparable to baseline models with relatively few numbers of tokens.

Weaknesses

Conceptual clarity and positioning - The introduction claims the method “combines the robustness of FSQ with the expressive capacity of multi-dimensional grids,” but FSQ is missing from the benchmark, weakening the narrative. Architecture and implementation details - Architecture details are insufficient. It is unclear which parts are inherited from WavTokenizer and what is modified. Learning rate schedule is not specified. - Training on a small dataset (LibriTTS) may reduce comparability wi

Reviewer 04Rating 6Confidence 5

Strengths

I find the Q2D2 discretization method to be intriguing. Although fundamentally similar to FSQ, the strategy of grouping channels into pairs for 2D plane quantization represents a novel and constructive modification. Therefore, my overall score is positive.

Weaknesses

1. Given that the core contribution of this work is the proposal of a novel quantization method (Q2D2), a standard and robust experimental configuration should include validation across broader domains (e.g., image, video, and general speech) to verify the resulting reconstruction quality and downstream generation performance. 2. The paper requires additional ablation studies. Specifically, a comparison between an FSQ-based WavTokenizer, the proposed Q2D2-based WavTokenizer, and a WavTokenizer

Reviewer 05Rating 2Confidence 4

Strengths

- The paper offers a clean, geometrically motivated quantization formulation. Pairwise 2D grids bridge the gap between FSQ’s stability and VQ’s expressiveness. - The use of Straight-Through Estimators and lightweight projection layers makes the approach compatible with standard training pipelines. The method avoids extra losses or codebook-management tricks, which is a good simplification. - Semantic-representation tests on the ARCH benchmark provide an initial indication that the learned codes

Weaknesses

- The model is trained solely on LibriTTS (500+h), whereas major baselines (e.g., WavTokenizer, DAC, Encodec) use multi-domain datasets with speech, music, and general audio totaling over 8 k hours. This discrepancy makes it difficult to isolate whether performance gains come from the proposed quantization scheme or from differences in data composition (especially training only on speech is easier than , normalization, and pre-training scope. The paper acknowledges this briefly but still draws s

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques · Speech and Audio Processing