Fox-1: Open Small Language Model for Cloud and Edge

Zijian Hu; Jipeng Zhang; Rui Pan; Zhaozhuo Xu; Shanshan Han; Han Jin,; Alay Dilipbhai Shah; Dimitris Stripelis; Yuhang Yao; Salman Avestimehr; Tong; Zhang; Chaoyang He

arXiv:2411.05281·cs.CL·April 9, 2025

Fox-1: Open Small Language Model for Cloud and Edge

Zijian Hu, Jipeng Zhang, Rui Pan, Zhaozhuo Xu, Shanshan Han, Han Jin,, Alay Dilipbhai Shah, Dimitris Stripelis, Yuhang Yao, Salman Avestimehr, Tong, Zhang, Chaoyang He

PDF

Open Access 4 Models

TL;DR

Fox-1 introduces a series of small, efficient language models with a novel training curriculum and architecture enhancements, achieving competitive performance and accessibility for open-source communities.

Contribution

The paper presents Fox-1, a new small language model series with a unique 3-stage data curriculum and architectural improvements like Grouped Query Attention.

Findings

01

Achieves comparable or better performance than larger models on benchmarks.

02

Introduces a novel 3-stage data curriculum for improved training efficiency.

03

Models are openly released under Apache 2.0 license.

Abstract

We present Fox-1, a series of small language models (SLMs) consisting of Fox-1-1.6B and Fox-1-1.6B-Instruct-v0.1. These models are pre-trained on 3 trillion tokens of web-scraped document data and fine-tuned with 5 billion tokens of instruction-following and multi-turn conversation data. Aiming to improve the pre-training efficiency, Fox-1-1.6B model introduces a novel 3-stage data curriculum across all the training data with 2K-8K sequence length. In architecture design, Fox-1 features a deeper layer structure, an expanded vocabulary, and utilizes Grouped Query Attention (GQA), offering a performant and efficient architecture compared to other SLMs. Fox-1 achieves better or on-par performance in various benchmarks compared to StableLM-2-1.6B, Gemma-2B, Qwen1.5-1.8B, and OpenELM1.1B, with competitive inference speed and throughput. The model weights have been released under the Apache…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpace Satellite Systems and Control · Gas Dynamics and Kinetic Theory

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings