SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?
John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce, Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R., Narasimhan, Diyi Yang, Sida I. Wang, Ofir Press

TL;DR
This paper introduces SWE-bench Multimodal, a new benchmark for evaluating AI systems' ability to fix bugs in visual, JavaScript-based software, revealing current limitations and demonstrating the effectiveness of a language-agnostic approach.
Contribution
The paper presents SWE-bench Multimodal, a novel benchmark with visual tasks in JavaScript, and shows that a language-agnostic system outperforms existing models on this new domain.
Findings
Top systems struggle with visual, cross-language tasks.
SWE-agent resolves twice as many tasks as the next best system.
Visual and cross-language generalization remains a challenge for current AI systems.
Abstract
Autonomous systems for software engineering are now capable of fixing bugs and developing features. These systems are commonly evaluated on SWE-bench (Jimenez et al., 2024a), which assesses their ability to solve software issues from GitHub repositories. However, SWE-bench uses only Python repositories, with problem statements presented predominantly as text and lacking visual elements such as images. This limited coverage motivates our inquiry into how existing systems might perform on unrepresented software engineering domains (e.g., front-end, game development, DevOps), which use different programming languages and paradigms. Therefore, we propose SWE-bench Multimodal (SWE-bench M), to evaluate systems on their ability to fix bugs in visual, user-facing JavaScript software. SWE-bench M features 617 task instances collected from 17 JavaScript libraries used for web interface design,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Semantic Web and Ontologies
