From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning
Shaojie Wang, Liang Zhang

TL;DR
This paper introduces a cognitively-inspired post-training framework for large language models that separates abstract reasoning from specific problem execution, leading to improved generalization, reliability, and training efficiency.
Contribution
It proposes Chain-of-Meta-Thought and Confidence-Calibrated Reinforcement Learning to better align model training with human problem-solving cognition.
Findings
Achieves 2.19% and 4.63% improvements on benchmarks
Reduces training time by 65-70%
Cuts token consumption by 50%
Abstract
Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought (CoMT) focuses supervised learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · AI-based Problem Solving and Planning · Topic Modeling
