General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

Junlin Liu; Shengnan An; Shuang Zhou; Dan Ma; Shixiong Luo; Ying Xie; Yuan Zhang; Wenling Yuan; Yifan Zhou; Xiaoyu Li; Ziwen Wang; Xuezhi Cao; Xunliang Cai

arXiv:2604.11778·cs.CL·April 14, 2026

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

Junlin Liu, Shengnan An, Shuang Zhou, Dan Ma, Shixiong Luo, Ying Xie, Yuan Zhang, Wenling Yuan, Yifan Zhou, Xiaoyu Li, Ziwen Wang, Xuezhi Cao, Xunliang Cai

PDF

2 Repos 2 Datasets

TL;DR

General365 is a new benchmark designed to evaluate large language models' ability to perform general reasoning across diverse, challenging tasks with limited background knowledge, revealing current models' domain-dependent reasoning skills.

Contribution

The paper introduces General365, a comprehensive benchmark with 365 seed and 1,095 variant problems across eight categories to assess general reasoning in LLMs.

Findings

01

Top LLMs achieve only 62.8% accuracy on General365.

02

Current LLM reasoning skills are heavily domain-dependent.

03

Significant room for improvement in general reasoning abilities.

Abstract

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.