OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution
Lianghong Guo, Wei Tao, Runhan Jiang, Yanlin Wang, Jiachi Chen, Xilin, Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, Zibin Zheng

TL;DR
OmniGIRL is a comprehensive benchmark for GitHub issue resolution that evaluates multilingual, multimodal, and multi-domain capabilities of large language models, revealing current models' limited performance especially with multimodal data.
Contribution
The paper introduces OmniGIRL, a novel benchmark covering multiple languages, domains, and modalities, addressing limitations of existing single-language, text-only benchmarks.
Findings
Current LLMs perform poorly on OmniGIRL, with GPT-4o resolving only 8.6% of issues.
Models struggle with issues requiring image understanding, with the best model resolving only 10.5%.
Analysis provides insights into why LLMs fail, guiding future research.
Abstract
The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in large language models (LLMs), this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current benchmarks focus on a single programming language, limiting the evaluation of issues from repositories across different languages. Second, they usually cover a narrow range of domains, which may fail to represent the diversity of real-world issues. Third, existing benchmarks rely solely on textual information in issue descriptions, overlooking multimodal information such as images in issues. In this paper, we propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain. OmniGIRL includes 959 task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Topic Modeling
