Agentic Frameworks for Reasoning Tasks: An Empirical Study
Zeeshan Rasheed, Abdul Malik Sami, Muhammad Waseem, Kai-Kristian Kemell, Mika Saari, Pekka Abrahamsson

TL;DR
This empirical study compares 22 agentic frameworks across reasoning benchmarks, highlighting their performance, efficiency, and practical issues like orchestration and cost, to guide better framework selection.
Contribution
It provides the first large-scale empirical comparison of agentic frameworks for reasoning tasks, emphasizing the importance of orchestration quality.
Findings
19 of 22 frameworks completed all benchmarks.
Stable frameworks achieved ~75% accuracy with low cost and fast execution.
Poor performance mainly due to orchestration issues like context growth and retries.
Abstract
Recent advances in agentic frameworks have enabled AI agents to perform complex reasoning and decision-making. However, evidence comparing their reasoning performance, efficiency, and practical suitability remains limited. To address this gap, we empirically evaluate 22 widely used agentic frameworks across three reasoning benchmarks: BBH, GSM8K, and ARC. The frameworks were selected from 1,200 GitHub repositories collected between January 2023 and July 2025 and organized into a taxonomy based on architectural design. We evaluated them under a unified setting, measuring reasoning accuracy, execution time, computational cost, and cross-benchmark consistency. Our results show that 19 of the 22 frameworks completed all three benchmarks. Among these, 12 showed stable performance, with mean accuracy of 74.6-75.9%, execution time of 4-6 seconds per task, and cost of 0.14-0.18 cents per…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
