LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research

Shuo Yan; Ruochen Li; Ziming Luo; Zimu Wang; Daoyang Li; Liqiang Jing; Kaiyu He; Peilin Wu; George Michalopoulos; Yue Zhang; Ziyang Zhang; Mian Zhang; Zhiyu Chen; Xinya Du

arXiv:2506.17335·cs.SE·June 24, 2025

LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research

Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, Xinya Du

PDF

1 Repo 1 Video

TL;DR

This paper introduces LMR-BENCH, a comprehensive benchmark to evaluate large language models' ability to accurately reproduce code from scientific NLP research papers, revealing significant current limitations.

Contribution

The paper presents LMR-BENCH, the first benchmark specifically designed to assess LLMs' performance in reproducing research code from scientific publications.

Findings

01

State-of-the-art LLMs show limited accuracy in code reproduction

02

Models struggle with complex scientific reasoning tasks

03

Significant gaps remain in LLMs' ability to autonomously reproduce research code

Abstract

Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code repositories with interdependent files. Motivated by this gap, we present LMR-BENCH, a benchmark designed to systematically evaluate the capability of LLM agents on code reproduction from Language Modeling Research. It consists of 28 code reproduction tasks derived from 23 research papers published in top-tier NLP venues over the past five years, spanning nine fundamental categories. Models are provided with a research paper, a code repository containing one or more masked functions, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

du-nlp-lab/lmr-bench
noneOfficial

Videos

LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research· underline