Scaling Context, Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing

Chen Wu; Yin Song

arXiv:2505.08651·cs.CL·May 14, 2025

Scaling Context, Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing

Chen Wu, Yin Song

PDF

1 Models 1 Video

TL;DR

This paper introduces MegaBeam-Mistral-7B, a compact 7-billion-parameter language model capable of processing 512,000 tokens of context, enabling efficient long-range reasoning and practical long-context applications without extensive fine-tuning.

Contribution

The work presents a novel 7B model supporting 512K tokens, demonstrating competitive long-range reasoning and practical utility in long-context tasks, with open-source release and broad accessibility.

Findings

01

Outperforms on HELMET in in-context learning

02

Shows robust retrieval and tracing on RULER

03

Achieves competitive long-range reasoning on BABILong

Abstract

We present MegaBeam-Mistral-7B, a language model that supports 512K-token context length. Our work addresses practical limitations in long-context training, supporting real-world tasks such as compliance monitoring and verification. Evaluated on three long-context benchmarks, our 7B-parameter model demonstrates superior in-context learning performance on HELMET and robust retrieval and tracing capability on RULER. It is currently the only open model to achieve competitive long-range reasoning on BABILong at 512K context length without RAG or targeted fine-tuning. Released as fully open source under the Apache 2.0 license, the model has been downloaded over 100,000 times on Hugging Face. Model available at: https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-512k

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
aws-prototyping/MegaBeam-Mistral-7B-512k
model· 8.2k dl· ♡ 53
8.2k dl♡ 53

Videos

Scaling Context, Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing· underline

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Byte Pair Encoding · Attention Dropout · Softmax · Residual Connection · WordPiece