Segmental Attention Decoding With Long Form Acoustic Encodings

Pawel Swietojanski; Xinwei Li; Mingbin Xu; Takaaki Hori; Dogan Can; Xiaodan Zhuang

arXiv:2512.14652·eess.AS·December 17, 2025

Segmental Attention Decoding With Long Form Acoustic Encodings

Pawel Swietojanski, Xinwei Li, Mingbin Xu, Takaaki Hori, Dogan Can, Xiaodan Zhuang

PDF

Open Access

TL;DR

This paper proposes four modifications to attention-based encoder-decoder models to improve their ability to decode long-form acoustic signals, addressing issues with position encoding and segmentation.

Contribution

The paper introduces novel techniques including explicit positional encodings, long-form training, segment concatenation, and semantic segmentation to enhance long-form acoustic decoding.

Findings

01

Modified models close the accuracy gap between continuous and segmented encodings.

02

Explicit positional encodings improve long-form decoding performance.

03

Training with extended context enables better generalization to long segments.

Abstract

We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Multimodal Machine Learning Applications