Cartridges: Lightweight and general-purpose long context representations via self-study
Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, Christopher Re

TL;DR
This paper introduces Cartridges, a method for creating lightweight, reusable long-context representations that mimic in-context learning, significantly reducing memory usage and enabling efficient long-text processing.
Contribution
The paper proposes self-study training for Cartridges, enabling them to replicate ICL performance with much lower memory and computational costs, and allowing composition at inference.
Findings
Cartridges trained with self-study match ICL performance on long-context benchmarks.
Self-study reduces memory usage by 38.6x and increases throughput by 26.4x.
Cartridges extend effective context length and can be composed without retraining.
Abstract
Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate…
Peer Reviews
Decision·ICLR 2026 Poster
1. The self-study technique leverages synthetic data generation and distillation objectives to further improve context compression, enabling high-quality representation learning beyond naive next-token prediction. 2. The experimental results show that Cartridge significantly reduces memory consumption and increases throughput by over 26x compared to traditional ICL methods. Despite resource savings, Cartridge achieves comparable performance as full-context ICL on challenging benchmarks. 3. The w
1. Effectiveness relies heavily on generating high-quality synthetic conversations. There should have been some analysis on the data quality and cost of synthetic data. 2. It would be better to compare with some naive methods that have a similar idea with Cartridge, e.g., existing prompt compression methods + SFT with your synthetic data.
1. The paper proposed a new method called self-study, which learns to store the document knowledge in the learnable KV cache parameters. The method amortize the inference compute to training compute, which is useful in many real-world applications. 2. The authors carefully studied the effect of different initialization strategies, self-study compute, diversity of seed data augmentation prompts, etc., showing solid investigation on the factors that can impact the performance. 3. The work also stu
1. The Figures have very small font and is very hard to read. 2. The paper didn't compare with existing baselines on memory layer and active reading, nor discussing these works in the related work section, e.g., [1][2]. 3. There is missing one strong baseline that a summarizer is used to condense the document in the text space before feeding long document to context. [1] https://arxiv.org/abs/2412.09764 [2] https://arxiv.org/abs/2508.09494
- The paper makes a clear, practical contribution. The proposed method has a high potential. Using self-study with chunking, the approach handles corpora beyond the model’s window. Cartridges can be concatenated at inference time. They are easy to serve with the existing infrastructure. - The paper offers useful design ablations, including parameterization and initialization strategies.
The reproducibility of this paper is a potential weak point. Authors don’t provide code to reproduce their results, which may significantly diminish confidence and the potential impact. Empirical results lack evaluations on real-world heterogeneous benchmarks, such as those with code or multimodal documents. Operational procedures of versioning and updating cartridges are not discussed.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
