TL;DR
Micro language models enable instant on-device responses by generating initial words locally and seamlessly collaborating with cloud models for completion, suitable for ultra-resource-constrained devices.
Contribution
Introduction of ultra-compact micro language models that generate initial response segments locally and collaborate with cloud models for completion, reducing latency on edge devices.
Findings
Micro models match larger models in useful language generation.
Collaborative framework enables seamless mid-sentence handoffs.
Empirical results demonstrate effective on-device response initiation.
Abstract
Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models (LMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
