OmniCode: A Benchmark for Evaluating Software Engineering Agents

Atharv Sonwane; Eng-Shen Tu; Wei-Chung Lu; Claas Beger; Carter Larsen; Debjit Dhar; Simon Alford; Rachel Chen; Ronit Pattanayak; Tuan Anh Dang; Guohao Chen; Gloria Geng; Kevin Ellis; Saikat Dutta

arXiv:2602.02262·cs.SE·May 19, 2026

OmniCode: A Benchmark for Evaluating Software Engineering Agents

Atharv Sonwane, Eng-Shen Tu, Wei-Chung Lu, Claas Beger, Carter Larsen, Debjit Dhar, Simon Alford, Rachel Chen, Ronit Pattanayak, Tuan Anh Dang, Guohao Chen, Gloria Geng, Kevin Ellis, Saikat Dutta

PDF

1 Repo

TL;DR

OmniCode is a comprehensive benchmark with 1794 diverse software engineering tasks across three languages, designed to evaluate and improve LLM-powered coding agents beyond narrow scopes.

Contribution

It introduces a broad, validated, and synthetically generated set of software tasks for evaluating engineering agents, addressing limitations of prior benchmarks.

Findings

01

SWE-Agent performs well on Python bug fixing but struggles on other tasks.

02

Maximum of 25.0% accuracy achieved by SWE-Agent on C++ Test Generation.

03

OmniCode reveals gaps in current agent capabilities across languages and task types.

Abstract

LLM-powered coding agents are redefining how real-world software is developed. To drive the research towards better coding agents, we require challenging benchmarks that can rigorously evaluate the ability of such agents to perform various software engineering tasks. However, popular coding benchmarks such as HumanEval and SWE-Bench focus on narrowly scoped tasks such as competition programming and patch generation. In reality, software engineers have to handle a broader set of tasks for real-world software development. To address this gap, we propose OmniCode, a novel software engineering benchmark that contains a broader and more diverse set of task categories beyond code or patch generation. Overall, OmniCode contains 1794 tasks spanning three programming languages - Python, Java, and C++ - and four key categories: bug fixing, test generation, code review fixing, and style fixing. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seal-research/OmniCode
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software Engineering Techniques and Practices