BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Nolan Dey; Daria Soboleva; Faisal Al-Khateeb; Bowen Yang and; Ribhu Pathria; Hemant Khachane; Shaheer Muhammad; Zhiming (Charles); Chen; Robert Myers; Jacob Robert Steeves; Natalia Vassilieva and; Marvin Tom; Joel Hestness

arXiv:2309.11568·cs.AI·September 22, 2023·2 cites

BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

Nolan Dey, Daria Soboleva, Faisal Al-Khateeb, Bowen Yang and, Ribhu Pathria, Hemant Khachane, Shaheer Muhammad, Zhiming (Charles), Chen, Robert Myers, Jacob Robert Steeves, Natalia Vassilieva and, Marvin Tom, Joel Hestness

PDF

Open Access 1 Repo 3 Models

TL;DR

BTLM-3B-8K is a new open-source 3 billion parameter language model that outperforms similar models, offers excellent long-context capabilities, and is optimized for low-resource environments, making powerful NLP more accessible.

Contribution

The paper introduces BTLM-3B-8K, a state-of-the-art 3B parameter language model with innovative training techniques and architecture optimizations that enable high performance and long-context understanding.

Findings

01

Outperforms existing 3B models by 2-5.5% on downstream tasks.

02

Competitive with some 7B parameter models.

03

Requires only 3GB memory with 4-bit precision, enabling deployment on edge devices.

Abstract

We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new state-of-the-art 3 billion parameter open-source language model. BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms all existing 3B parameter models by 2-5.5% across downstream tasks. BTLM-3B-8K is even competitive with some 7B parameter models. Additionally, BTLM-3B-8K provides excellent long context performance, outperforming MPT-7B-8K and XGen-7B-8K on tasks up to 8,192 context length. We trained the model on a cleaned and deduplicated SlimPajama dataset; aggressively tuned the \textmu P hyperparameters and schedule; used ALiBi position embeddings; and adopted the SwiGLU nonlinearity. On Hugging Face, the most popular models have 7B parameters, indicating that users prefer the quality-size ratio of 7B models. Compacting the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cerebras/modelzoo
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSwiGLU · Attention with Linear Biases