KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal   Reasoning Tasks

Kaijing Ma; Xinrun Du; Yunran Wang; Haoran Zhang; Zhoufutu Wen,; Xingwei Qu; Jian Yang; Jiaheng Liu; Minghao Liu; Xiang Yue; Wenhao Huang; Ge; Zhang

arXiv:2410.06526·cs.DB·March 4, 2025

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen,, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, Wenhao Huang, Ge, Zhang

PDF

Open Access 1 Video

TL;DR

KOR-Bench introduces a new benchmark for evaluating language models' reasoning abilities across diverse, knowledge-orthogonal tasks, emphasizing rule application and out-of-distribution performance.

Contribution

The paper proposes the KOR-Bench benchmark, focusing on knowledge-orthogonal reasoning tasks, and demonstrates its effectiveness through new model evaluations and detailed analyses.

Findings

01

O1-Preview and O1-Mini outperform GPT-4o and Claude-3.5-Sonnet in accuracy.

02

Stepwise Prompting with Self-Correction improves Cipher task performance.

03

KOR-Bench provides insights into reasoning bottlenecks and model capabilities.

Abstract

In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), a concept aimed at minimizing reliance on domain-specific knowledge, enabling more accurate evaluation of models' reasoning abilities in out-of-distribution settings. Based on this concept, we propose the Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench), encompassing five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. KOR-Bench emphasizes models' effectiveness in applying new rule descriptions to solve novel rule-driven questions. O1-Preview and O1-Mini achieve accuracies of 72.88% and 70.16%, surpassing Claude-3.5-Sonnet and GPT-4o (58.96% and 58.00%), highlighting the effectiveness of KOR-Bench. We perform detailed analyses, identifying bottlenecks in the Cipher task with Stepwise Prompting, where two rounds of Self-Correction yield optimal results. We evaluate performance across three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks· slideslive

Taxonomy

TopicsNatural Language Processing Techniques