NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment
Wenqing Wu, Yi Zhao, Yuzhuo Wang, Siyou Li, Juexi Shao, Yunfei Long, Chengzhi Zhang

TL;DR
NovBench is a large-scale benchmark designed to evaluate large language models' ability to assess research novelty, highlighting current models' limited understanding and the need for improved fine-tuning.
Contribution
The paper introduces NovBench, the first dedicated benchmark for evaluating LLMs' performance in scientific novelty assessment with a comprehensive evaluation framework.
Findings
Current LLMs show limited understanding of scientific novelty.
Fine-tuned models often struggle with instruction-following.
The benchmark reveals significant room for improvement in LLMs' novelty evaluation capabilities.
Abstract
Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promise in generating review comments, the absence of a dedicated benchmark has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs' capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper-review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
