xGen-small Technical Report

Erik Nijkamp; Bo Pang; Egor Pakhomov; Akash Gokul; Jin Qu; Silvio Savarese; Yingbo Zhou; Caiming Xiong

arXiv:2505.06496·cs.CL·May 13, 2025

xGen-small Technical Report

Erik Nijkamp, Bo Pang, Egor Pakhomov, Akash Gokul, Jin Qu, Silvio Savarese, Yingbo Zhou, Caiming Xiong

PDF

Open Access 6 Models

TL;DR

xGen-small is a new family of Transformer models designed for long-context tasks, combining innovative data curation, multi-stage pre-training, and targeted fine-tuning to achieve strong performance in math, coding, and long-context benchmarks.

Contribution

It introduces a comprehensive pipeline for training long-context Transformer models, including data curation, multi-stage pre-training, and targeted post-training, which is novel in this domain.

Findings

01

Strong performance in math and coding tasks

02

Excels at long context benchmarks

03

Effective long-context modeling up to 128k tokens

Abstract

We introduce xGen-small, a family of 4B and 9B Transformer decoder models optimized for long-context applications. Our vertically integrated pipeline unites domain-balanced, frequency-aware data curation; multi-stage pre-training with quality annealing and length extension to 128k tokens; and targeted post-training via supervised fine-tuning, preference learning, and online reinforcement learning. xGen-small delivers strong performance across various tasks, especially in math and coding domains, while excelling at long context benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Wireless Signal Modulation Classification · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax · Absolute Position Encodings