Doc-to-LoRA: Learning to Instantly Internalize Contexts
Rujikorn Charakorn, Edoardo Cetin, Shinnosuke Uesaka, Robert Tjarko Lange

TL;DR
Doc-to-LoRA (D2L) is a hypernetwork that quickly generates adapters for large language models, enabling efficient long-context understanding and reasoning without reprocessing the entire input, thus reducing memory and latency.
Contribution
D2L introduces a meta-learning approach to produce adapters in a single forward pass, improving long-context processing and adaptation speed for LLMs.
Findings
Achieves near-perfect zero-shot accuracy on long-context tasks.
Reduces peak memory consumption and update latency.
Outperforms standard context distillation on real-world QA datasets.
Abstract
Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that…
Peer Reviews
Decision·Submitted to ICLR 2026
The project is well-motivated and clearly contextualized within the literature It is interesting if it’s possible to create a hypernetwork that replaces the process of running gradient descent. The chosen architecture makes sense and is clearly explained The paper chooses a comprehensive set of evaluation settings / benchmark tasks
There are places where the experiments and writing can be clarified. These are specifically provided below: Questions about the current paper contents: L076-079: The writing could be clarified: Why is chunking needed? Why is NIAH the right evaluation for success? Why is training limited to 256 tokens? L140-142: What type of data was used in the meta-training dataset? What was the process for generating the context, queries, and responses L245-258: Why include samples from the QA datasets? Esp
- This paper addresses an important problem: how to reduce the memory consumption of in-context learning over large documents. - The D2L method is simple and intuitive. - Can extend the context length of the model 4x on the NIAH task. This result is very interesting! - Compared with context distillation, it has significantly lower internalization cost. - Well-written and easy to follow
- The main claim of the paper is that D2L “outperforms CD with improved internalization efficiency.” This claim lacks nuance and is incautious. [Prior work](https://arxiv.org/abs/2506.06266) shows that performance of context distillation improves steadily as you increase the number of generated queries (up to hundreds of thousands of queries). However the authors compare against CD with only 5 generated queries for context distillation. Given this, the broad claim that D2L outperforms CD both in
* The idea is interesting. This work uses D2L Hypernet to process the chunks, and then uses the weights of each document to the LLM to generate the response. * The paper is written clearly. Figure 1 presents the pipeline of the training process and the data construction. The method part describes how to create and use the activation. * The experiment supports that the Doc-Lora could have great performance, with a few additional costs. * This work provides the pseudo-code for the Doc-to-Lora, mak
* The baseline may be missed. This work does not compare the performance with other text-to-weight works. * This work uses gemma for the experiment. It is not clear whether the claim is still the same as other models. * It is not clear how the performance is on the distribution dataset. For example, if you train on PwC and test on SQuAD, what is the performance.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Text Readability and Simplification
