Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
Ze Yuan, Yanqing Liu, Shujie Liu, Sheng Zhao

TL;DR
This paper introduces Flow-Omni, a continuous speech token model that enhances multi-modality learning in LLMs, improving robustness and real-time speech interaction by replacing discrete tokens with continuous representations.
Contribution
The paper proposes a novel continuous speech token approach using flow matching loss, enabling more robust multi-modality learning in LLMs compared to traditional discrete token methods.
Findings
Continuous speech tokens improve robustness over discrete tokens.
Flow-Omni achieves low latency real-time speech interaction.
Continuous tokens mitigate representation loss issues.
Abstract
Recent advances in GPT-4o like multi-modality models have demonstrated remarkable progress for direct speech-to-speech conversation, with real-time speech interaction experience and strong speech understanding ability. However, current research focuses on discrete speech tokens to align with discrete text tokens for language modelling, which depends on an audio codec with residual connections or independent group tokens, such a codec usually leverages large scale and diverse datasets training to ensure that the discrete speech codes have good representation for varied domain, noise, style data reconstruction as well as a well-designed codec quantizer and encoder-decoder architecture for discrete token language modelling. This paper introduces Flow-Omni, a continuous speech token based GPT-4o like model, capable of real-time speech interaction and low streaming latency. Specifically,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems
