Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring
Alex Heyman, Joel Zylberberg

TL;DR
This paper assesses large language models' systematic reasoning abilities using graph coloring problems, revealing progress and limitations in their problem-solving accuracy and reliability across different problem complexities.
Contribution
It introduces a novel benchmarking approach using graph coloring to evaluate LLM reasoning, highlighting the models' strengths and weaknesses in structured problem-solving.
Findings
Models exhibit >60% error on difficult problems in all frames.
No model achieves perfect accuracy on simple 2-coloring problems.
Framing effects significantly influence model performance.
Abstract
Contemporary large language models are powerful problem-solving tools, but they exhibit weaknesses in their reasoning abilities which ongoing research seeks to mitigate. We investigate graph coloring as a means of evaluating an LLM's capacities for systematic step-by-step reasoning and possibility space exploration, as well as effects of semantic problem framing. We test Claude 3.5 Sonnet, Llama 3.1 405B, Gemini 1.5 Pro, GPT-4o, o1-mini, and DeepSeek-R1 on a dataset of -coloring problems with and vertex count , using partial algorithmic solvers to further categorize problems by difficulty. In addition to substantial but varying framing effects, we find that all models except o1-mini and R1 exhibit error rates on difficult problem types in all frames ( for o1-mini and for R1), and no model achieves perfect accuracy even in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsLLaMA
