CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science   Research Repositories

Yijia Xiao; Runhui Wang; Luyang Kong; Davor Golac; Wei Wang

arXiv:2502.06111·cs.SE·February 13, 2025

CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories

Yijia Xiao, Runhui Wang, Luyang Kong, Davor Golac, Wei Wang

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces CSR-Bench, a benchmark for evaluating LLMs in deploying computer science research repositories, and presents CSR-Agents, a framework for automating repository deployment to improve research workflow efficiency.

Contribution

The paper presents CSR-Bench for assessing LLMs in research deployment and introduces CSR-Agents, a novel multi-agent framework for automating code repository deployment.

Findings

01

LLM agents can automate repository deployment tasks effectively.

02

Preliminary results show increased productivity in research workflows.

03

CSR-Bench provides a comprehensive evaluation of LLM capabilities in research settings.

Abstract

The increasing complexity of computer science research projects demands more effective tools for deploying code repositories. Large Language Models (LLMs), such as Anthropic Claude and Meta Llama, have demonstrated significant advancements across various fields of computer science research, including the automation of diverse software engineering tasks. To evaluate the effectiveness of LLMs in handling complex code development tasks of research projects, particularly for NLP/CV/AI/ML/DM topics, we introduce CSR-Bench, a benchmark for Computer Science Research projects. This benchmark assesses LLMs from various aspects including accuracy, efficiency, and deployment script quality, aiming to explore their potential in conducting computer science research autonomously. We also introduce a novel framework, CSR-Agents, that utilizes multiple LLM agents to automate the deployment of GitHub…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ai-coscientist/researcher-ablation-bench
dataset· 22 dl
22 dl

Videos

CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories· underline

Taxonomy

TopicsResearch Data Management Practices · Digital Rights Management and Security · Scientific Computing and Data Management

MethodsSparse Evolutionary Training