AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability

Siwei Yang; Bingchen Zhao; Cihang Xie

arXiv:2402.09404·cs.CL·June 23, 2025·2 cites

AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability

Siwei Yang, Bingchen Zhao, Cihang Xie

PDF

Open Access 1 Repo

TL;DR

AQA-Bench is an interactive benchmark designed to evaluate large language models' ability to perform sequential reasoning in algorithmic tasks, revealing insights into model strengths, weaknesses, and the effects of different evaluation strategies.

Contribution

The paper introduces AQA-Bench, a novel interactive benchmark with multiple algorithms to assess LLMs' sequential reasoning, and provides comprehensive analysis of model performance and influencing factors.

Findings

01

Strong models like GPT-4 outperform open-source models.

02

Naive in-context examples can reduce performance in interactive settings.

03

Limited predecessor steps can improve small models' performance.

Abstract

This paper introduces AQA-Bench, a novel benchmark to assess the sequential reasoning capabilities of large language models (LLMs) in algorithmic contexts, such as depth-first search (DFS). The key feature of our evaluation benchmark lies in its interactive evaluation protocol - for example, in DFS, the availability of each node's connected edge is contingent upon the model's traversal to that node, thereby necessitating the LLM's ability to effectively remember visited nodes and strategize subsequent moves considering the possible environmental feedback in the future steps. We comprehensively build AQA-Bench with three different algorithms, namely binary search, depth-first search, and breadth-first search, and to evaluate the sequential reasoning ability of 14 different LLMs. Our investigations reveal several interesting findings: (1) Closed-source models like GPT-4 and Gemini…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ucsc-vlaa/aqa-bench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpen Education and E-Learning · Intelligent Tutoring Systems and Adaptive Learning · Big Data and Business Intelligence

MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Dropout · Linear Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Softmax · Byte Pair Encoding · Multi-Head Attention