FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation
Junseok Lee, Sangyong Lee, Chang-Jae Chun

TL;DR
FastSLM introduces a hierarchical, token-efficient speech processing architecture that significantly reduces input tokens while maintaining performance, enabling scalable, real-time long-form speech understanding with lower computational costs.
Contribution
The paper proposes FastSLM with a Hierarchical Frame Querying Transformer that compresses speech representations across multiple scales, addressing the scalability challenge in long-form speech modeling for LLMs.
Findings
Reduces speech representation rate to 1.67 tokens/sec
Achieves 93% token reduction compared to frame-level methods
Maintains competitive performance on long-form benchmarks
Abstract
Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision, language, and video understanding tasks, scaling them to long-form speech remains a critical bottleneck due to the explosive growth of input tokens. Existing speech-language models typically project high-frame-rate acoustic features directly into the LLM input space, rendering long-context processing computationally prohibitive as audio duration increases. In this paper, we present FastSLM, a token-efficient architecture designed to overcome this scalability limit through extreme temporal compression. At its core is the Hierarchical Frame Querying Transformer (HFQ-Former), which progressively distills local acoustic details into compact, semantically rich representations across multiple temporal scales. This hierarchical abstraction reduces the speech representation rate to just 1.67…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Speech and Audio Processing
