Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of Large Language Models
Gabriele Prato, Shagun Sodhani, Alessandro Sordoni, Sarath Chandar

TL;DR
This paper investigates how document packing during training affects large language models' multi-hop reasoning abilities, revealing performance improvements and trade-offs in computational efficiency.
Contribution
It is the first comprehensive study analyzing the impact of document packing strategies on the reasoning capabilities of large language models.
Findings
Packing improves reasoning performance compared to individual document training.
Packing increases computational requirements.
Ablation studies identify key factors behind packing benefits.
Abstract
The standard practice for training large language models involves packing multiple documents together to optimize computational efficiency. However, the impact of this process on the models' capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing can improve model performance compared to training on individual documents, at the expense of more compute. To further understand the underlying mechanisms, we conduct an ablation study, identifying key factors that explain the advantages of packing. Ultimately, our research deepens the understanding of LLM training dynamics and provides practical insights for optimizing model development.
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well-written, easy to follow, and well-motivated. The importance of knowing how to pack documents to achieve the best knowledge retrieval from multiple documents at the same time is an important problem to solve. - The experiments are thorough, and multiple ablations were carried out to justify the different choices (such as re-packing, or cross-attention). The paper shows that using packing during leads to substantial increases in performance in multi-hop question answering.
The main weakness of this paper is that the proposed continual pre-training seems to be closer to fine-tuning than actual pre-training. By packing documents that are assigned to a specific question in the dataset, and then performing cross-attention between them, the model is learning that if it retrieves one of the correct documents, then the model knows from learning to output the next document, which one is relevant. This could be seen as a form of leaking test information to the dataset. The
The paper addresses a genuinely underexplored but important topic—how document packing affects the quality (not just efficiency) of LLM training. Given the ubiquity of packing in large-scale pre-training, this study is timely and of high practical relevance. The authors explore multiple factors (packing granularity, cross-document attention, repacking vs. fixed contexts, batch size effects) and cleanly isolate their roles. The diagonal comparison in Table 3 effectively demonstrates that packing
what is the evidence showing that cross-document attention yields richer contextual representations (e.g., attention map analysis, representational similarity, or probing). Without such analysis, the causal link between packing and improved reasoning remains speculative. Table 2 shows that packing increases compute, but there is no normalized comparison (e.g., accuracy per FLOP). Since the main selling point of packing is efficiency, the practical utility of adopting more expensive packing stra
- Clear empirical question with systematic ablations (packing vs. batch vs. repacking). - Practical insight that repacking each epoch is crucial for performance. - Honest discussion of compute–quality trade-offs. - Consistent “sweet spot” trend (4–6 documents) across settings.
- The attention masking is ambiguous: §3.1 contains contradictory statements about whether cross-document attention is enabled, leaving the default mask unclear. - The evaluation scope is conveniently narrow: experiments center on HotpotQA-easy and a 10% 2Wiki subsample with oracle per-question packing, which limits external validity. - The metrics are biased: hallucination is computed only conditional on correct title selection, and an LLM-as-judge substitutes for standard EM/F1, skewing the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
