OmniCode: A Benchmark for Evaluating Software Engineering Agents
Atharv Sonwane, Eng-Shen Tu, Wei-Chung Lu, Claas Beger, Carter Larsen, Debjit Dhar, Simon Alford, Rachel Chen, Ronit Pattanayak, Tuan Anh Dang, Guohao Chen, Gloria Geng, Kevin Ellis, Saikat Dutta

TL;DR
OmniCode is a comprehensive benchmark with 1794 diverse software engineering tasks across three languages, designed to evaluate and improve LLM-powered coding agents beyond narrow scopes.
Contribution
It introduces a broad, validated, and synthetically generated set of software tasks for evaluating engineering agents, addressing limitations of prior benchmarks.
Findings
SWE-Agent performs well on Python bug fixing but struggles on other tasks.
Maximum of 25.0% accuracy achieved by SWE-Agent on C++ Test Generation.
OmniCode reveals gaps in current agent capabilities across languages and task types.
Abstract
LLM-powered coding agents are redefining how real-world software is developed. To drive the research towards better coding agents, we require challenging benchmarks that can rigorously evaluate the ability of such agents to perform various software engineering tasks. However, popular coding benchmarks such as HumanEval and SWE-Bench focus on narrowly scoped tasks such as competition programming and patch generation. In reality, software engineers have to handle a broader set of tasks for real-world software development. To address this gap, we propose OmniCode, a novel software engineering benchmark that contains a broader and more diverse set of task categories beyond code or patch generation. Overall, OmniCode contains 1794 tasks spanning three programming languages - Python, Java, and C++ - and four key categories: bug fixing, test generation, code review fixing, and style fixing. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software Engineering Techniques and Practices
