SRA: Span Representation Alignment for Large Language Model Distillation

Quoc Phong Dao; Hoang Son Nguyen; Pham Khanh Chi; Tung Nguyen; Linh Ngo Van; Nguyen Thi Ngoc Diep; Trung Le

arXiv:2605.01205·cs.CL·May 5, 2026

SRA: Span Representation Alignment for Large Language Model Distillation

Quoc Phong Dao, Hoang Son Nguyen, Pham Khanh Chi, Tung Nguyen, Linh Ngo Van, Nguyen Thi Ngoc Diep, Trung Le

PDF

TL;DR

This paper introduces SRA, a novel span representation alignment framework for large language model distillation that improves cross-architecture knowledge transfer by focusing on robust, tokenizer-agnostic spans modeled as multi-particle systems.

Contribution

SRA redefines token alignment by using semantic-rich spans as the fundamental unit, employing a physical analogy and geometric regularization to enhance distillation effectiveness.

Findings

01

SRA outperforms state-of-the-art CTKD methods in cross-architecture distillation.

02

Modeling spans as multi-particle systems improves semantic robustness.

03

Attention-weighted span centers of mass enhance knowledge transfer.

Abstract

Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce \textbf{SRA} (\textbf{S}pan \textbf{R}epresentation \textbf{A}lignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.