When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

Juan Gabriel Kostelec; Xiang Wang; Axel Laborieux; Christos Sourmpis; Qinghai Guo

arXiv:2603.26556·cs.CL·March 30, 2026

When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo

PDF

TL;DR

This paper introduces a new hybrid model architecture and distillation pipeline that emphasizes generation-based evaluation, revealing that traditional log-likelihood metrics can mislead model quality assessments.

Contribution

The authors propose Hybrid-KDA architecture and GenDistill pipeline, demonstrating the importance of generation-based evaluation for effective distillation of sequence models.

Findings

01

Log-likelihood evaluation underestimates the teacher-student gap.

02

Generation-based evaluation can reverse conclusions from perplexity-only metrics.

03

Dataset choice and training strategies significantly impact generation quality.

Abstract

Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.