PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

Adhiraj Banerjee; Vipul Arora

arXiv:2605.06582·cs.LG·May 8, 2026

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

Adhiraj Banerjee, Vipul Arora

PDF

TL;DR

PairAlign introduces a novel sequence-level self-alignment framework for audio tokenization, improving compactness, consistency, and edit-distance preservation over existing methods.

Contribution

It presents a scalable, sequence-level self-alignment approach that refines audio tokenization beyond local quantization, enabling better sequence consistency and edit-distance preservation.

Findings

01

Reduces archive token count by 55% on TIMIT retrieval.

02

Learns compact, broad-vocabulary sequences with strong cross-view consistency.

03

Maintains edit-distance search capabilities with improved token efficiency.

Abstract

Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view's sequence is trained to be likely under the other's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.