Can Large Language Models Generalize Procedures Across Representations?

Fangru Lin; Valentin Hofmann; Xingchen Wan; Weixing Wang; Zifeng Ding; Anthony G. Cohn; Janet B. Pierrehumbert

arXiv:2602.03542·cs.CL·February 4, 2026

Can Large Language Models Generalize Procedures Across Representations?

Fangru Lin, Valentin Hofmann, Xingchen Wan, Weixing Wang, Zifeng Ding, Anthony G. Cohn, Janet B. Pierrehumbert

PDF

Open Access

TL;DR

This paper investigates how large language models can generalize procedures across different representations like code, graphs, and natural language, proposing a curriculum to improve cross-representation understanding and demonstrating near-GPT-4 performance.

Contribution

The paper introduces a two-stage curriculum training method that enhances LLMs' ability to generalize procedures across multiple representations, bridging the gap between symbolic and natural language tasks.

Findings

01

Training on graphs or code alone does not reliably generalize to natural language tasks.

02

A two-stage curriculum improves model performance across tasks and model sizes.

03

A 1.5B Qwen model with the curriculum matches zero-shot GPT-4 performance in planning.

Abstract

Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage data curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare