WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

Hanna Lee; Tan Dat Nguyen; Jaehoon Kang; Kyuhong Shim

arXiv:2604.08558·cs.CL·April 13, 2026

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

Hanna Lee, Tan Dat Nguyen, Jaehoon Kang, Kyuhong Shim

PDF

TL;DR

WAND introduces a memory-efficient autoregressive TTS framework using windowed attention and knowledge distillation, maintaining high quality with reduced computational costs.

Contribution

It proposes a novel attention mechanism and training strategy that enable constant complexity autoregressive TTS models without sacrificing quality.

Findings

01

Achieves up to 66.2% KV cache memory reduction.

02

Maintains high-fidelity speech synthesis.

03

Provides near-constant per-step latency.

Abstract

Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent global attention over conditioning tokens and local sliding-window attention over generated tokens. To stabilize fine-tuning, we employ a curriculum learning strategy that progressively tightens the attention window. We further utilize knowledge distillation from a full-attention teacher to recover high-fidelity synthesis quality with high data efficiency. Evaluated on three modern AR-TTS models, WAND preserves the original quality while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.