Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking
Alireza S. Ziabari, Nona Ghazizadeh, Zhivar Sourati, Farzan Karimi-Malekabadi, Payam Piray, Morteza Dehghani

TL;DR
This paper explores aligning large language models with human-like reasoning styles, System 1 and System 2, to improve their performance and adaptability across various reasoning tasks.
Contribution
It introduces a method to explicitly align LLMs with System 1 and System 2 reasoning styles and demonstrates a dynamic combination approach for better performance.
Findings
System 2 models excel in arithmetic and symbolic reasoning.
System 1 models perform better in commonsense reasoning.
Combining models based on entropy improves overall benchmark performance.
Abstract
Large Language Models (LLMs) exhibit impressive reasoning abilities, yet their reliance on structured step-by-step processing reveals a critical limitation. In contrast, human cognition fluidly adapts between intuitive, heuristic (System 1) and analytical, deliberative (System 2) reasoning depending on the context. This difference between human cognitive flexibility and LLMs' reliance on a single reasoning style raises a critical question: while human fast heuristic reasoning evolved for its efficiency and adaptability, is a uniform reasoning approach truly optimal for LLMs, or does its inflexibility make them brittle and unreliable when faced with tasks demanding more agile, intuitive responses? To answer these questions, we explicitly align LLMs to these reasoning styles by curating a dataset with valid System 1 and System 2 answers, and evaluate their performance across reasoning…
Peer Reviews
Decision·Submitted to NeurIPS 2025
Strengths: * This paper presents an interesting perspective by linking problem types to different reasoning styles (System 1 and System 2). * The experimental results effectively support the proposed concept. * Extensive analyses are provided, offering intuitive insights into model behavior. Weaknesses: * The connection between the benchmark categories and the corresponding reasoning styles (System 1 vs. System 2), as claimed by the author, appears insufficiently substantiated and requires furt
Strengths: - **Quality**: The experimental design is rigorous, well-controlled, and grounded in both machine learning methodology and cognitive science theory. The authors take great care to control for confounding factors such as response length and structure. - **Clarity**: The paper is well-written and clearly structured. Key concepts—such as dual-process theory, preference optimization, and reasoning trade-offs—are introduced with sufficient background for both NLP and cognitive science aud
**Strengths:** - The submission looks at accuracy in more than just surface-level ways. By using interpolation to look at the reasoning spectrum, the trade-offs in token efficiency, and the study of model uncertainty (through logits, hedge words, and response definitiveness), we can learn a lot about how the alignment process changes behavior. - The submission tests 13 different benchmarks using a variety of models, such as Llama-3 and Mistral-7B, as well as alignment methods like DPO and SimP
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
