Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM
Krzysztof Fonal

TL;DR
This paper evaluates cross-family speculative decoding for Polish language models on Apple Silicon, introducing UAG extension to improve inference speed and analyzing conditions affecting its effectiveness.
Contribution
It extends the MLX-LM framework with UAG for cross-tokenizer decoding on Apple Silicon and provides the first empirical evaluation for Polish LLMs in this context.
Findings
Context-aware translation improves acceptance rates.
Polish-specialized Bielik 1.5B has lower acceptance than general-purpose models.
Throughput varies with content, reaching 1.7x speedup for structured text.
Abstract
Speculative decoding accelerates LLM inference by using a small draft model to propose k candidate tokens for a target model to verify. While effective for same-tokenizer pairs on high-bandwidth GPUs, its applicability to cross-family pairs with mismatched tokenizers and consumer-grade unified memory remains underexplored. We extend the MLX-LM framework with Universal Assisted Generation (UAG) to enable cross-tokenizer speculative decoding on Apple Silicon. We evaluate Bielik 11B-Instruct (Mistral-based) as the target model, paired with three draft models: Bielik 1.5B (Qwen-based with custom tokenizer), Qwen2.5-1.5B, and Llama 3.2-1B. Experiments on three Polish-language datasets (Wikipedia, pl_alpaca, synthetic) use draft lengths k in {2, 4, 6} to compare naive and context-aware token translation. Results show: (1) context-aware translation consistently improves acceptance rates across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
