When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models
Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo

TL;DR
This paper introduces a new hybrid model architecture and distillation pipeline that emphasizes generation-based evaluation, revealing that traditional log-likelihood metrics can mislead model quality assessments.
Contribution
The authors propose Hybrid-KDA architecture and GenDistill pipeline, demonstrating the importance of generation-based evaluation for effective distillation of sequence models.
Findings
Log-likelihood evaluation underestimates the teacher-student gap.
Generation-based evaluation can reverse conclusions from perplexity-only metrics.
Dataset choice and training strategies significantly impact generation quality.
Abstract
Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
