Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

Ruiying Peng; Mengyu Yang; Jing Lei; Xiaohui Li; Xueyu Wu; Xinlei Chen

arXiv:2605.09270·cs.LG·May 12, 2026

Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

Ruiying Peng, Mengyu Yang, Jing Lei, Xiaohui Li, Xueyu Wu, Xinlei Chen

PDF

TL;DR

This paper introduces Theorem-SFT, a fine-tuning method that improves reasoning generalization by focusing on theorem application rather than surface answer patterns, leading to significant performance gains.

Contribution

Theorem-SFT reorients supervision towards explicit theorem application, reducing reliance on spurious correlations and enhancing reasoning generalization across models and benchmarks.

Findings

01

+8.8% on MATH benchmark with Theorem-SFT

02

+20.27% on GeoQA benchmark with Theorem-SFT

03

Fine-tuning MLP layers alone matches full-layer performance

Abstract

Supervised Fine-Tuning (SFT) is widely used for task-specific adaptation, yet recent work shows it systematically undermines reasoning generalization. We argue the root cause is not memorization itself, but its target: vanilla SFT drives models to exploit and memorize spurious surface correlations in problem-solution pairs, leaving them brittle to superficial input variations. To address this, we propose Theorem-SFT, which reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like. Theorem-SFT yields consistent gains across benchmarks and model families: +8.8% on MATH (LLaMA3.2-3B-Instruct) and +20.27% on GeoQA (Qwen2.5-VL-7B-Instruct) without modality-specific re-training. Fine-tuning MLP layers alone matches full-layers performance, implicating feed-forward components as the primary locus of reasoning rules.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.