ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents

Hao Kang; Chenyan Xiong

arXiv:2406.10291·cs.AI·September 9, 2025

ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents

Hao Kang, Chenyan Xiong

PDF

Open Access 1 Video

TL;DR

ResearchArena is a benchmark designed to evaluate large language models' ability to conduct academic surveys by simulating the research process in information discovery, selection, and organization, highlighting current limitations and future opportunities.

Contribution

This paper introduces ResearchArena, a novel benchmark for assessing LLMs' research capabilities, including a comprehensive offline environment and evaluation framework for academic survey tasks.

Findings

01

LLMs underperform compared to keyword-based retrieval methods

02

Recent reasoning models like DeepSeek-R1 show improved zero-shot performance

03

Significant opportunities exist for advancing LLMs in autonomous research

Abstract

Large language models (LLMs) excel across many natural language processing tasks but face challenges in domain-specific, analytical tasks such as conducting research surveys. This study introduces ResearchArena, a benchmark designed to evaluate LLMs' capabilities in conducting academic surveys -- a foundational step in academic research. ResearchArena models the process in three stages: (1) information discovery, identifying relevant literature; (2) information selection, evaluating papers' relevance and impact; and (3) information organization, structuring knowledge into hierarchical frameworks such as mind-maps. Notably, mind-map construction is treated as a bonus task, reflecting its supplementary role in survey-writing. To support these evaluations, we construct an offline environment of 12M full-text academic papers and 7.9K survey papers. To ensure ethical compliance, we do not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ResearchArena: Benchmarking Large Language Models’ Ability to Collect and Organize Information as Research Agents· underline

Taxonomy

TopicsLegal Education and Practice Innovations · Artificial Intelligence in Law