FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation

Junseok Lee; Sangyong Lee; Chang-Jae Chun

arXiv:2601.06199·eess.AS·February 3, 2026

FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation

Junseok Lee, Sangyong Lee, Chang-Jae Chun

PDF

Open Access

TL;DR

FastSLM introduces a hierarchical, token-efficient speech processing architecture that significantly reduces input tokens while maintaining performance, enabling scalable, real-time long-form speech understanding with lower computational costs.

Contribution

The paper proposes FastSLM with a Hierarchical Frame Querying Transformer that compresses speech representations across multiple scales, addressing the scalability challenge in long-form speech modeling for LLMs.

Findings

01

Reduces speech representation rate to 1.67 tokens/sec

02

Achieves 93% token reduction compared to frame-level methods

03

Maintains competitive performance on long-form benchmarks

Abstract

Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision, language, and video understanding tasks, scaling them to long-form speech remains a critical bottleneck due to the explosive growth of input tokens. Existing speech-language models typically project high-frame-rate acoustic features directly into the LLM input space, rendering long-context processing computationally prohibitive as audio duration increases. In this paper, we present FastSLM, a token-efficient architecture designed to overcome this scalability limit through extreme temporal compression. At its core is the Hierarchical Frame Querying Transformer (HFQ-Former), which progressively distills local acoustic details into compact, semantically rich representations across multiple temporal scales. This hierarchical abstraction reduces the speech representation rate to just 1.67…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Speech and Audio Processing