General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks
Junlin Liu, Shengnan An, Shuang Zhou, Dan Ma, Shixiong Luo, Ying Xie, Yuan Zhang, Wenling Yuan, Yifan Zhou, Xiaoyu Li, Ziwen Wang, Xuezhi Cao, Xunliang Cai

TL;DR
General365 is a new benchmark designed to evaluate large language models' ability to perform general reasoning across diverse, challenging tasks with limited background knowledge, revealing current models' domain-dependent reasoning skills.
Contribution
The paper introduces General365, a comprehensive benchmark with 365 seed and 1,095 variant problems across eight categories to assess general reasoning in LLMs.
Findings
Top LLMs achieve only 62.8% accuracy on General365.
Current LLM reasoning skills are heavily domain-dependent.
Significant room for improvement in general reasoning abilities.
Abstract
Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
