SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research   Repositories

Ben Bogin; Kejuan Yang; Shashank Gupta; Kyle Richardson; Erin Bransom,; Peter Clark; Ashish Sabharwal; Tushar Khot

arXiv:2409.07440·cs.AI·September 12, 2024

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories

Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom,, Peter Clark, Ashish Sabharwal, Tushar Khot

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

SUPER is a benchmark designed to evaluate how well large language models can autonomously set up and execute research tasks from repositories, highlighting current limitations and guiding future improvements.

Contribution

This paper introduces SUPER, the first benchmark for assessing LLMs' ability to reproduce research results from repositories, including diverse problem sets and evaluation metrics.

Findings

01

GPT-4o solves only 16.3% of end-to-end problems

02

State-of-the-art models struggle with research repository tasks

03

SUPER provides a challenging resource for future progress

Abstract

Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand, and extend prior work. To advance towards this goal, we introduce SUPER, the first benchmark designed to evaluate the capability of LLMs in setting up and executing tasks from research repositories. SUPERaims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges (e.g., configuring a trainer), and 602 automatically generated problems for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allenai/super-benchmark
noneOfficial

Datasets

allenai/super
dataset· 388 dl
388 dl

Videos

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories· underline

Taxonomy

TopicsScientific Computing and Data Management · Semantic Web and Ontologies

MethodsSparse Evolutionary Training · Focus