SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom,, Peter Clark, Ashish Sabharwal, Tushar Khot

TL;DR
SUPER is a benchmark designed to evaluate how well large language models can autonomously set up and execute research tasks from repositories, highlighting current limitations and guiding future improvements.
Contribution
This paper introduces SUPER, the first benchmark for assessing LLMs' ability to reproduce research results from repositories, including diverse problem sets and evaluation metrics.
Findings
GPT-4o solves only 16.3% of end-to-end problems
State-of-the-art models struggle with research repository tasks
SUPER provides a challenging resource for future progress
Abstract
Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand, and extend prior work. To advance towards this goal, we introduce SUPER, the first benchmark designed to evaluate the capability of LLMs in setting up and executing tasks from research repositories. SUPERaims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges (e.g., configuring a trainer), and 602 automatically generated problems for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsScientific Computing and Data Management · Semantic Web and Ontologies
MethodsSparse Evolutionary Training · Focus
