A Comprehensive Evaluation of LLM Reasoning: From Single-Model to Multi-Agent Paradigms

Yapeng Li; Jiakuo Yu; Zhixin Liu; Xinnan Liu; Jing Yu; Songze Li; Tonghua Su

arXiv:2601.13243·cs.LG·January 21, 2026

A Comprehensive Evaluation of LLM Reasoning: From Single-Model to Multi-Agent Paradigms

Yapeng Li, Jiakuo Yu, Zhixin Liu, Xinnan Liu, Jing Yu, Songze Li, Tonghua Su

PDF

Open Access

TL;DR

This paper provides a comprehensive evaluation of various reasoning paradigms for Large Language Models, comparing their performance, cost-accuracy trade-offs, and introducing a new benchmark for semantic reasoning capabilities.

Contribution

It offers a unified analysis of single-model, chain-of-thought, and multi-agent reasoning paradigms, and introduces MIMeBench for assessing semantic abstraction and contrastive discrimination.

Findings

01

Increased structural complexity does not always improve reasoning performance.

02

Multi-agent systems can offer favorable cost-accuracy trade-offs.

03

Semantic capabilities are crucial for comprehensive reasoning assessment.

Abstract

Large Language Models (LLMs) are increasingly deployed as reasoning systems, where reasoning paradigms - such as Chain-of-Thought (CoT) and multi-agent systems (MAS) - play a critical role, yet their relative effectiveness and cost-accuracy trade-offs remain poorly understood. In this work, we conduct a comprehensive and unified evaluation of reasoning paradigms, spanning direct single-model generation, CoT-augmented single-model reasoning, and representative MAS workflows, characterizing their reasoning performance across a diverse suite of closed-form benchmarks. Beyond overall performance, we probe role-specific capability demands in MAS using targeted role isolation analyses, and analyze cost-accuracy trade-offs to identify which MAS workflows offer a favorable balance between cost and accuracy, and which incur prohibitive overhead for marginal gains. We further introduce MIMeBench,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare