Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition

Aditya Srinivas Menon; Kumud Tripathi; Raj Gohil; Pankaj Wasnik

arXiv:2602.09043·eess.AS·February 11, 2026

Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition

Aditya Srinivas Menon, Kumud Tripathi, Raj Gohil, Pankaj Wasnik

PDF

Open Access

TL;DR

This paper introduces Windowed SummaryMixing, a linear-time method that enhances self-supervised speech models with local context, enabling efficient fine-tuning for low-resource speech recognition with reduced memory usage.

Contribution

The paper proposes Windowed SummaryMixing (WSM), a novel local context integration method, and a selective fine-tuning strategy that improves efficiency and performance in low-resource SSL speech models.

Findings

01

WSM improves ASR performance over global summary methods.

02

Reduces peak VRAM usage by 40%.

03

Maintains linear-time complexity with enhanced context awareness.

Abstract

Self-supervised learning (SSL) has advanced speech processing but suffers from quadratic complexity due to self-attention. To address this, SummaryMixing (SM) has been proposed as a linear-time alternative that summarizes entire utterances using mean pooling but lacks sufficient local context. In this work, we introduce Windowed SummaryMixing (WSM), which enhances SM by integrating local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies. Additionally, we introduce a selective fine-tuning approach, replacing self-attention layers in SSL models with WSM blocks and fine-tuning only these blocks in low-resource settings. Our approach improves ASR performance while reducing peak VRAM usage by 40\% in the SSL models. WSM blocks have linear-time complexity with enhanced context awareness. Selectively replacing some attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research