Learning Facts at Scale with Active Reading

Jessy Lin; Vincent-Pierre Berges; Xilun Chen; Wen-Tau Yih; Gargi Ghosh; Barlas O\u{g}uz

arXiv:2508.09494·cs.CL·August 14, 2025

Learning Facts at Scale with Active Reading

Jessy Lin, Vincent-Pierre Berges, Xilun Chen, Wen-Tau Yih, Gargi Ghosh, Barlas O\u{g}uz

PDF

1 Models 3 Reviews

TL;DR

This paper introduces Active Reading, a training framework that enhances large language models' ability to learn and recall facts reliably by using self-guided study strategies, leading to significant improvements in factual knowledge.

Contribution

The paper proposes Active Reading, a novel training approach that improves factual learning in large language models, demonstrated through expert domain training and large-scale pretraining.

Findings

01

Models with Active Reading outperform vanilla finetuning on knowledge absorption.

02

Active Reading improves accuracy on factual benchmarks by over 300%.

03

Meta WikiExpert-8B surpasses larger models in factual question answering.

Abstract

LLMs are known to store vast amounts of knowledge in their parametric memory. However, learning and recalling facts from this memory is known to be unreliable, depending largely on the prevalence of particular facts in the training data and other factors which are poorly understood. Practitioners are lacking tools which will allow them to ensure that the models learn a given body of knowledge reliably and consistently. To this end, we propose Active Reading: a framework where we train models to study a given set of material with self-generated learning strategies. First, we demonstrate models trained with Active Reading on expert domains absorb significantly more knowledge than vanilla finetuning and other data augmentations. We train expert 8B models that achieve 66% on a Wikipedia-grounded subset of SimpleQA (+313% relative over vanilla finetuning) and 26% on FinanceBench (+160%…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The proposed method is straightforward and demonstrates significant gains compared to baseline methods, in particular for the 8B model. 2. The analysis is thorough and examines a variety of questions about their proposed method. Each analysis section provides valuable insights into why their method works and under what settings it does.

Weaknesses

1. The results with the 70B model have substantially smaller improvements when compared to the 8B -- improving accuracy on SimpleQA <2% versus ~60% according to Table 3. Given this dramatic difference, it would be helpful to see the full results from the primary settings and some of the analysis repeated with this model. Likewise, experimenting with another base model (non-llama) would also help substantiate these results. 2. Including more randomly selected samples of both task agnostic and ta

Reviewer 02Rating 6Confidence 3

Strengths

1. The Active Reading method is intuitive, scalable, and presents a clever way to generate highly diverse synthetic data by leveraging the model's own capabilities. 2. The empirical results are extremely strong, particularly the performance of the 8B model on Simple WikiQA which nearly matches the gold context baseline. 3. The release of WikiExpert 8B is a significant contribution, as it achieves state of the art factual recall for its size class and provides a powerful, compact model for fact i

Weaknesses

1. While the method excels at information extraction, its performance on the full FinanceBench benchmark is notably weaker than the synthetic QA baseline. This suggests the generated strategies may not adequately cover complex reasoning, a point the paper acknowledges but does not fully resolve. 2. The finding in Table 3 that data generated by a 70B model leads to worse performance for an 8B model than its own self generated data is highly counterintuitive. This result is not deeply investigated

Reviewer 03Rating 4Confidence 3

Strengths

- The core idea of this work is both innovative and well-grounded. The proposed Active Reading framework, inspired by human learning behaviors, offers a conceptually sound and intuitively appealing approach to improving factual knowledge acquisition through self-generated data augmentation. - The work is methodologically solid, with clear experimental design, comprehensive ablation studies, and detailed scaling analyses. The authors provide meaningful comparisons with strong baselines such as p

Weaknesses

- In line 240, the authors state that they add mixed pre-training data to prevent model degradation, but they do not provide experiments to demonstrate the occurrence of such degradation. - The conclusion in lines 274–276 seems rather obvious. Since SimpleWiki can be viewed as representing long-tail knowledge, while the expanded dataset introduces new knowledge, training on new data may naturally interfere with long-tail knowledge retention. This outcome is not surprising. - Figure 3 is confus

Code & Models

Models

🤗
eve-esa/EVE-Instruct
model· 685 dl· ♡ 3
685 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.