AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability
Siwei Yang, Bingchen Zhao, Cihang Xie

TL;DR
AQA-Bench is an interactive benchmark designed to evaluate large language models' ability to perform sequential reasoning in algorithmic tasks, revealing insights into model strengths, weaknesses, and the effects of different evaluation strategies.
Contribution
The paper introduces AQA-Bench, a novel interactive benchmark with multiple algorithms to assess LLMs' sequential reasoning, and provides comprehensive analysis of model performance and influencing factors.
Findings
Strong models like GPT-4 outperform open-source models.
Naive in-context examples can reduce performance in interactive settings.
Limited predecessor steps can improve small models' performance.
Abstract
This paper introduces AQA-Bench, a novel benchmark to assess the sequential reasoning capabilities of large language models (LLMs) in algorithmic contexts, such as depth-first search (DFS). The key feature of our evaluation benchmark lies in its interactive evaluation protocol - for example, in DFS, the availability of each node's connected edge is contingent upon the model's traversal to that node, thereby necessitating the LLM's ability to effectively remember visited nodes and strategize subsequent moves considering the possible environmental feedback in the future steps. We comprehensively build AQA-Bench with three different algorithms, namely binary search, depth-first search, and breadth-first search, and to evaluate the sequential reasoning ability of 14 different LLMs. Our investigations reveal several interesting findings: (1) Closed-source models like GPT-4 and Gemini…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOpen Education and E-Learning · Intelligent Tutoring Systems and Adaptive Learning · Big Data and Business Intelligence
MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Dropout · Linear Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Softmax · Byte Pair Encoding · Multi-Head Attention
