# Entropy-based Coarse and Compressed Semantic Speech Representation Learning

**Authors:** Jialong Zuo, Guangyan Zhang, Minghui Fang, Shengpeng Ji, Xiaoqi Jiao, Jingyu Li, Yiwen Guo, Zhou Zhao

arXiv: 2509.00503 · 2025-09-03

## TL;DR

This paper introduces an entropy-based method for learning compressed semantic speech representations that reduce redundancy and improve efficiency without sacrificing performance in various speech tasks.

## Contribution

It proposes a novel entropy-driven aggregation framework that adaptively compresses speech representations, enhancing efficiency and maintaining or improving task performance.

## Key findings

- Compressed representations match or outperform dense tokens in ASR.
- Flexible control over granularity and compression ratio.
- Effective across multiple speech tasks like translation and voice conversion.

## Abstract

Discrete speech representation learning has recently attracted increasing interest in both acoustic and semantic modeling. Existing approaches typically encode 16 kHz waveforms into discrete tokens at a rate of 25 or 50 tokens per second. However, given that speech generally conveys only 2 to 5 words per second, such fine-grained tokenization introduces redundancy and hinders efficiency in downstream training and inference. Moreover, semantic speech representations at this frequency primarily capture phonetic-level information, while semantic understanding may not require such detailed token-level resolution. To address these limitations, we propose an entropy-based dynamic aggregation framework for learning compressed semantic speech representations. A speech language model is first pre-trained via next-token prediction on large-scale unlabeled data to capture frequent token patterns. Predictive entropy is then used to adaptively determine aggregation boundaries, followed by a cross-attention module that fuses information within each segment. By adjusting the entropy threshold, the granularity and compression ratio of the representations can be flexibly controlled. Experiments on ASR, speech-to-text translation, and voice conversion tasks demonstrate that the compressed representations perform on par with or better than dense token sequences, demonstrating the effectiveness of the proposed approach.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00503/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/2509.00503/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/2509.00503/full.md

---
Source: https://tomesphere.com/paper/2509.00503