Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs

Qinyuan Wu; Soumi Das; Mahsa Amani; Bishwamittra Ghosh; Mohammad Aflah Khan; Krishna P. Gummadi; Muhammad Bilal Zafar

arXiv:2507.21914·cs.CL·March 3, 2026

Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs

Qinyuan Wu, Soumi Das, Mahsa Amani, Bishwamittra Ghosh, Mohammad Aflah Khan, Krishna P. Gummadi, Muhammad Bilal Zafar

PDF

3 Reviews

TL;DR

This paper demonstrates that large language models can generalize over rote memorized data through a two-phase framework, challenging the belief that rote learning only hinders understanding and highlighting both opportunities and risks.

Contribution

It introduces a novel 'memorize-then-generalize' framework showing LLMs can reinterpret rote memorized data via semantic prompts, enabling effective knowledge transfer.

Findings

01

Models can reinterpret rote memorized data with semantic prompts

02

Emergence of structured, semantically aligned latent representations

03

Potential for both knowledge injection and malicious data reuse

Abstract

Rote learning is a memorization technique based on repetition. Many researchers argue that rote learning hinders generalization because it encourages verbatim memorization rather than deeper understanding. This concern extends even to factual knowledge, which inevitably requires a certain degree of memorization. In this work, we challenge this view and demonstrate that large language models (LLMs) can, in fact, generalize over rote memorized data. We introduce a two-phase "memorize-then-generalize" framework, where the model first rote memorizes factual subject-object associations using a synthetic semantically meaningless key token and then learns to generalize by fine-tuning on a small set of semantically meaningful prompts. Extensive experiments over 8 LLMs show that the models can reinterpret rote memorized data through the semantically meaningful prompts, as evidenced by the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper presents a more efficient and effective method for knowledge injection than standard SFT or ICL . 2. The decision to use a fully synthetic dataset with fictional entities is a major strength. This eliminates the confounding variable of pre-existing knowledge and ensures that the model is learning the facts entirely from the training, making the findings on knowledge acquisition highly reliable. 3. The findings are shown to be consistent across 8 different LLMs from 4 model families,

Weaknesses

1. A primary goal of knowledge injection is to update or correct existing, incorrect facts stored in a model's parameters. The paper's framework is not tested in this more realistic and challenging scenario. It's unclear if the "memorize-then-generalize" method would be effective, or perhaps even detrimental, when the rote-learned fact (e.g., Paris [X] Germany) conflicts with strong pre-trained knowledge.

Reviewer 02Rating 2Confidence 4

Strengths

- The experimental design is simple and clear, enabling controlled analysis of memory and generalization dynamics. - Results are reported across multiple models and evaluation types.

Weaknesses

- The authors mentioned in paper, "To the best of our knowledge, this is the first work to systematically show that LLMs are able to generalize from memorized data", is clearly an overclaim. Many papers that the authors cite in their paper already demonstrate phenomena where generalization emerges after extensive memorization. I think the authors need to have a better understanding of the current work about the generalization emerge from memorization. - The paper observes limited generalization

Reviewer 03Rating 8Confidence 4

Strengths

- The paper tackles an important topic of knowledge injection and manipulation - The idea is simple and effective and seems to be a promising direction for light-weight knowledge injection - The experiments are comprehensive and convincing; the three generalization scheme setup is appropriate and the analyses are illuminating

Weaknesses

One obstacle for rote-learning to become practical is that training on synthetic, non-semantic token might degrade model performance on other tasks especially when it's trained for 20 epochs (despite improvement on knowledge injection). The authors already tested the unrelted knowledge in the analysis. It would be good to see if this method is non-disruptive over a wider range of domains beyond knowledge.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.