SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum Domain
Zixiang Wan, Guochang Zhang, Yifeng He, Jianqiang Wei

TL;DR
SpecTokenizer is a lightweight streaming audio codec operating in the spectral domain, offering high efficiency and performance with significantly reduced computation and parameters, suitable for low-bitrate audio compression.
Contribution
It introduces SpecTokenizer, a novel lightweight streaming codec that uses CNN and RNN layers for efficient spectral domain audio compression, outperforming existing lightweight codecs.
Findings
Achieves comparable or better performance at 4 kbps with fewer resources.
Uses only 20% of the computation and 10% of parameters of state-of-the-art lightweight codecs.
Outperforms similar resource-usage codecs in quality.
Abstract
Neural Audio Codecs (NACs) have gained growing attention in recent years as technologies for audio compression and audio representation in speech language models. While mainstream NACs typically require G-level computation and M-level parameters, the performance of lightweight and streaming NACs remains underexplored. This paper proposes SpecTokenizer, a lightweight streaming codec that operates in the compressed spectral domain. Composed solely of alternating CNN and RNN layers, SpecTokenizer achieves greater efficiency and better representational capability through multi-scale modeling in the compressed spectrum domain. At 4 kbps, the proposed SpecTokenizer achieves comparable or superior performance compared to the codec with state-of-the-art lightweight architecture while requiring only 20% of the computation and 10% of the parameters. Furthermore, it significantly outperforms the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
