SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

Yixin Song; Zhenliang Xue; Dongliang Wei; Feiyang Chen; Jianxiang Gao; Junchen Liu; Hangyu Liang; Guangshuo Qin; Chengrong Tian; Bo Wen; Longyu Zhao; Xinrui Zheng; Zeyu Mi; Haibo Chen

arXiv:2507.20984·cs.LG·July 31, 2025

SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

Yixin Song, Zhenliang Xue, Dongliang Wei, Feiyang Chen, Jianxiang Gao, Junchen Liu, Hangyu Liang, Guangshuo Qin, Chengrong Tian, Bo Wen, Longyu Zhao, Xinrui Zheng, Zeyu Mi, Haibo Chen

PDF

3 Models

TL;DR

SmallThinker introduces a family of large language models optimized for local device deployment, overcoming hardware constraints through innovative architecture and inference techniques, enabling efficient on-device NLP without high-end hardware.

Contribution

We designed a deployment-aware architecture with sparse structures and prefetching mechanisms, enabling large language models to run efficiently on resource-limited local devices.

Findings

01

Models achieve state-of-the-art performance scores.

02

Models run at over 20 tokens/sec on consumer CPUs.

03

Models require minimal memory, 1GB and 8GB respectively.

Abstract

While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.