Return of the Encoder: Maximizing Parameter Efficiency for SLMs
Mohamed Elfeki, Rui Liu, Chad Voegele

TL;DR
This paper demonstrates that encoder-decoder architectures are more efficient than decoder-only models for small language models, offering significant latency and throughput advantages on edge devices, and introduces a knowledge distillation method to enhance their performance.
Contribution
The paper provides a systematic analysis of encoder-decoder versus decoder-only models for small language models and introduces a knowledge distillation framework to improve encoder-decoder capabilities.
Findings
Encoder-decoder models achieve 47% lower first-token latency.
Encoder-decoder models have 4.7x higher throughput on edge devices.
Knowledge distillation improves encoder-decoder performance by up to 6 points.
Abstract
The dominance of large decoder-only language models has overshadowed encoder-decoder architectures, despite their fundamental efficiency advantages in sequence processing. For small language models (SLMs) - those with 1 billion parameters or fewer - our systematic analysis across GPU, CPU, and NPU platforms reveals that encoder-decoder architectures achieve 47% lower first-token latency and 4.7x higher throughput compared to decoder-only models on edge devices. These gains may be attributed to encoder-decoder's one-time input processing and efficient separation of understanding and generation phases. We introduce a novel knowledge distillation framework that enables encoder-decoder models to leverage capabilities from large scalable decoder-only teachers while preserving their architectural advantages, achieving up to 6 average performance points improvement across diverse tasks, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence · Soft Robotics and Applications · Scheduling and Optimization Algorithms
MethodsKnowledge Distillation
