On Memorization of Large Language Models in Logical Reasoning

Chulin Xie; Yangsibo Huang; Chiyuan Zhang; Da Yu; Xinyun Chen; Bill; Yuchen Lin; Bo Li; Badih Ghazi; Ravi Kumar

arXiv:2410.23123·cs.CL·March 5, 2025·2 cites

On Memorization of Large Language Models in Logical Reasoning

Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill, Yuchen Lin, Bo Li, Badih Ghazi, Ravi Kumar

PDF

Open Access 2 Datasets 1 Video 3 Reviews

TL;DR

This paper investigates whether large language models rely on memorization or genuine reasoning in solving logical puzzles, revealing that they develop reasoning skills alongside memorization, with implications for understanding their capabilities.

Contribution

It provides a systematic, quantitative analysis of memorization versus reasoning in LLMs on logical puzzles, combining experiments, perturbation tests, and internal probing.

Findings

01

LLMs memorize training puzzles with high accuracy after fine-tuning.

02

Fine-tuning improves generalization despite heavy memorization.

03

Models switch between reasoning and memorization depending on the puzzle variation.

Abstract

Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs' reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles. We find that LLMs could interpolate and memorize the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet they struggle with slight variations of these puzzles. On the other hand, we show that while fine-tuning leads to heavy…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

- The paper attempts to reveal the nuanced relationship between memorization and reasoning, contributing to a deeper understanding of LLM capabilities and limitations. - Perturbation tests offer methods to assess LLMs' reasoning abilities independently of memorization.

Weaknesses

- The definition of "memorization" is vague. Is that the opposite of "generalization"? Why do we need to create a new term (and even a new metric) compared to the traditional term in machine learning research? - Following the question above, I think the definition of memorization score is too arbitrary and may be misleading. What's the meaning of multiplication of accuracy and CR? For example, (ACC=0.2, CR=0.2) and (ACC=0.8 and CR=0.8) will produce the same score. Do these two results have the

Reviewer 02Rating 5Confidence 4

Strengths

- Overall, I believe this paper is well written and easy to understand. - The authors have explained, their assumptions and experimental setup clearly. - Main figure explains most of experimental gist at glance.

Weaknesses

## Limitations - **State space with perturbations**: As the problem space is limited in terms of number of people, depth and width, Only maximum of 8, 2, 2 respectively. These limited dimensions make it relatively easy for models to interpolate the entire problem space with perturbations, potentially inflating perceived generalisation. - **Limited Evaluation** : The authors analyze only 8 models, yet they refer to it as a benchmark, which limits its claim to be benchmark. A more comprehensive e

Reviewer 03Rating 6Confidence 3

Strengths

1. Novel Quantification of Memorization: The paper introduces a new metric, the Local Inconsistency-based Memorization Score (LiMem), which provides a structured way to measure the extent of memorization versus reasoning in large language models (LLMs), a valuable contribution to understanding model behavior. 2. Thorough Empirical Analysis: The paper conducts an in-depth evaluation of models under various conditions, including fine-tuning, perturbation tests, and cross-difficulty transferabilit

Weaknesses

1. Limited Task Scope: The paper focuses solely on logical reasoning, particularly the Knights and Knaves puzzles. While this allows for deep analysis, it limits the generalizability of the conclusions. Experiments on other reasoning domains, such as mathematical reasoning or different types of logical reasoning, would strengthen the paper's claims and make the results more broadly applicable. 2. Lack of Surprising Results: The finding that reasoning abilities improve with memorization is not p

Code & Models

Datasets

Videos

On Memorization of Large Language Models in Logical Reasoning· youtube

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling