SWE-bench Multimodal: Do AI Systems Generalize to Visual Software   Domains?

John Yang; Carlos E. Jimenez; Alex L. Zhang; Kilian Lieret; Joyce; Yang; Xindi Wu; Ori Press; Niklas Muennighoff; Gabriel Synnaeve; Karthik R.; Narasimhan; Diyi Yang; Sida I. Wang; Ofir Press

arXiv:2410.03859·cs.CL·October 8, 2024

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce, Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R., Narasimhan, Diyi Yang, Sida I. Wang, Ofir Press

PDF

Open Access 3 Repos 1 Datasets

TL;DR

This paper introduces SWE-bench Multimodal, a new benchmark for evaluating AI systems' ability to fix bugs in visual, JavaScript-based software, revealing current limitations and demonstrating the effectiveness of a language-agnostic approach.

Contribution

The paper presents SWE-bench Multimodal, a novel benchmark with visual tasks in JavaScript, and shows that a language-agnostic system outperforms existing models on this new domain.

Findings

01

Top systems struggle with visual, cross-language tasks.

02

SWE-agent resolves twice as many tasks as the next best system.

03

Visual and cross-language generalization remains a challenge for current AI systems.

Abstract

Autonomous systems for software engineering are now capable of fixing bugs and developing features. These systems are commonly evaluated on SWE-bench (Jimenez et al., 2024a), which assesses their ability to solve software issues from GitHub repositories. However, SWE-bench uses only Python repositories, with problem statements presented predominantly as text and lacking visual elements such as images. This limited coverage motivates our inquiry into how existing systems might perform on unrepresented software engineering domains (e.g., front-end, game development, DevOps), which use different programming languages and paradigms. Therefore, we propose SWE-bench Multimodal (SWE-bench M), to evaluate systems on their ability to fix bugs in visual, user-facing JavaScript software. SWE-bench M features 617 task instances collected from 17 JavaScript libraries used for web interface design,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

hrtxsny/SWE-bench-plus
dataset· 83 dl
83 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Semantic Web and Ontologies