AcademicEval: Live Long-Context LLM Benchmark

Haozhen Zhang; Tao Feng; Pengrui Han; Jiaxuan You

arXiv:2510.17725·cs.CL·October 21, 2025

AcademicEval: Live Long-Context LLM Benchmark

Haozhen Zhang, Tao Feng, Pengrui Han, Jiaxuan You

PDF

Open Access 3 Reviews

TL;DR

AcademicEval is a live, flexible benchmark using arXiv papers to evaluate LLMs on long-context academic writing tasks without manual labeling, revealing current models' struggles with hierarchical abstraction and long demonstrations.

Contribution

It introduces a novel live benchmark for long-context evaluation using real academic papers and expert-curated demonstrations, avoiding label leakage and manual annotation.

Findings

01

LLMs perform poorly on hierarchical abstraction tasks.

02

Models struggle with long few-shot demonstrations.

03

Benchmark reveals significant challenges in long-context modeling.

Abstract

Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially,…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

This paper proposes KV-Distill, a model that is trained as an adaptor for pre-trained LLMs. KV-Distill employs a novel student-teacher type method.

Weaknesses

1. The experiments are conducted only on QA and summary type tasks. Other tasks in the LongBench set, such as code, are missing. 2. Key baselines, such as SnapKV, LLMLingua, Semantic Compression, D2O, KVMerger, are missing. 3. The summaries of the short story “Galactic Ghost” takes up a lot of space and could be moved to the appendix. 4. The paper mentions employing “semantic compression”, but key references are missing. Questions:

Reviewer 02Rating 3Confidence 2

Strengths

1 - By using arXiv papers, ACADEMICEVAL leverages readily available, high-quality academic content, which reduces reliance on labor-intensive manual annotation. 2 - The benchmark’s periodic updates from arXiv mitigate risks associated with label leakage and maintain relevance in LLM assessment. 3 - The benchmark includes multiple academic writing tasks with different abstraction levels, offering a broad evaluation framework for long-context generation.

Weaknesses

1 - The proposed benchmark does not introduce any fundamentally novel insights or methodological contributions for evaluating long-context LLMs, instead reusing existing concepts (e.g., hierarchical task structure and few-shot learning demonstrations). 2 - ACADEMICEVAL focuses on a narrow set of tasks related to academic writing, which limits its applicability and fails to test LLMs across a wider range of real-world long-context scenarios. 3 - The paper does not adequately demonstrate the lim

Reviewer 03Rating 3Confidence 3

Strengths

- The motivation of testing LLMs ability to perform long context text-generation tasks is great - Writing different sections of a paper is a nice challenge for this kind of task. - Having a benchmark that requires no human labor is beneficial - The idea of using live evaluation to avoid label leakage is decent

Weaknesses

- While the benchmark is about long context text-generation, focusing on generating the 4 subsections of Arxiv makes the benchmark not comprehensive enough for evaluating how well an LLM performs in the grandiose task of long context text-generation. There are so many other long context text-generation tasks that require different type of reasoning and setup. The title might need to be qualified to something like "evaluating llms on a subset of research paper generation". How do you ensure this

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Artificial Intelligence in Healthcare and Education