Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models

Shubhangi Upasani; Ravi Shanker Raju; Bo Li; Mengmeng Ji; John Long; Chen Wu; Urmish Thakker; Guangtao Wang

arXiv:2603.02631·cs.CL·March 16, 2026

Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models

Shubhangi Upasani, Ravi Shanker Raju, Bo Li, Mengmeng Ji, John Long, Chen Wu, Urmish Thakker, Guangtao Wang

PDF

Open Access

TL;DR

This paper demonstrates that cross-family speculative prefill effectively compresses prompts for large language models, reducing inference time while maintaining high performance, even when using draft models from different families.

Contribution

It introduces and evaluates a training-free prompt compression method that transfers attention-based token importance estimation across different model families, expanding its practical applicability.

Findings

01

Cross-family prompt compression retains 90-100% of baseline performance.

02

It significantly reduces time to first token (TTFT) in inference.

03

Attention-based token importance estimation transfers reliably across diverse models.

Abstract

Prompt length is a major bottleneck in agentic large language model (LLM) workloads, where repeated inference steps and multi-call loops incur substantial prefill cost. Recent work on speculative prefill demonstrates that attention-based token importance estimation can enable training-free prompt compression, but this assumes the existence of a draft model that shares the same tokenizer as the target model. In practice, however, agentic pipelines frequently employ models without any smaller in-family draft model. In this work, we study cross-family speculative prefill, where a lightweight draft model from one model family is used to perform prompt compression for a target model from a different family. Using the same speculative prefill mechanism as prior work, we evaluate a range of cross-family draft-target combinations, including Qwen, LLaMA, and DeepSeek models. Across a broad…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education