Squid: Long Context as a New Modality for Energy-Efficient On-Device Language Models
Wei Chen, Zhiyuan Li, Shuo Xin, Yihao Wang

TL;DR
This paper introduces Dolphin, a novel architecture that treats long textual contexts as a separate modality, enabling energy-efficient and low-latency on-device language processing without sacrificing accuracy.
Contribution
Dolphin's innovative approach repurposes image embedding techniques to encode long contexts, significantly reducing energy consumption and latency in on-device language models.
Findings
10-fold improvement in energy efficiency
5-fold reduction in latency
Maintains response quality with extended contexts
Abstract
This paper presents Dolphin, a novel decoder-decoder architecture for energy-efficient processing of long contexts in language models. Our approach addresses the significant energy consumption and latency challenges inherent in on-device models. Dolphin employs a compact 0.5B parameter decoder to distill extensive contextual information into a memory embedding, substantially reducing the input length for the primary 7B parameter decoder model. Inspired by vision-language models, we repurpose the image embedding projector to encode long textual contexts, effectively treating extended context as a distinct modality. This innovative method enables processing of substantially longer contexts without the typical computational overhead associated with extended input sequences. Empirical evaluations demonstrate a 10-fold improvement in energy efficiency and a 5-fold reduction in latency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence · Context-Aware Activity Recognition Systems
