GISA: A Benchmark for General Information-Seeking Assistant

Yutao Zhu; Xingshuo Zhang; Maosen Zhang; Jiajie Jin; Liancheng Zhang; Xiaoshuai Song; Kangzhi Zhao; Wencong Zeng; Ruiming Tang; Han Li; Ji-Rong Wen; Zhicheng Dou

arXiv:2602.08543·cs.CL·February 16, 2026

GISA: A Benchmark for General Information-Seeking Assistant

Yutao Zhu, Xingshuo Zhang, Maosen Zhang, Jiajie Jin, Liancheng Zhang, Xiaoshuai Song, Kangzhi Zhao, Wencong Zeng, Ruiming Tang, Han Li, Ji-Rong Wen, Zhicheng Dou

PDF

Open Access 1 Datasets

TL;DR

GISA is a new benchmark designed to evaluate general information-seeking assistants using realistic queries, structured answer formats, and process-level supervision, revealing significant performance gaps in current models.

Contribution

The paper introduces GISA, a comprehensive benchmark with human-crafted queries, diverse answer formats, and process trajectories to better evaluate and develop information-seeking LLMs.

Findings

01

Current models achieve only 19.30% exact match on GISA.

02

Performance drops on tasks requiring complex reasoning and broad information gathering.

03

GISA exposes substantial room for improvement in LLM-based search agents.

Abstract

The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

RUC-NLPIR/GISA
dataset· 2.7k dl
2.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Expert finding and Q&A systems