CONGRA: Benchmarking Automatic Conflict Resolution
Qingyu Zhang, Liangcai Su, Kai Ye, Chenxiong Qian

TL;DR
ConGra introduces a large-scale, conflict-graded benchmark dataset for evaluating large language models' effectiveness in automatic software conflict resolution across varying complexity levels.
Contribution
This paper presents a novel conflict classification scheme and a comprehensive dataset of nearly 45,000 conflicts from real projects for benchmarking LLMs in software merging tasks.
Findings
LLMs' performance varies significantly with conflict complexity.
The dataset reveals unexpected insights into LLM capabilities.
Benchmarking uncovers limitations of current LLMs in conflict resolution.
Abstract
Resolving conflicts from merging different software versions is a challenging task. To reduce the overhead of manual merging, researchers develop various program analysis-based tools which only solve specific types of conflicts and have a limited scope of application. With the development of language models, researchers treat conflict code as text, which theoretically allows for addressing almost all types of conflicts. However, the absence of effective conflict difficulty grading methods hinders a comprehensive evaluation of large language models (LLMs), making it difficult to gain a deeper understanding of their limitations. Furthermore, there is a notable lack of large-scale open benchmarks for evaluating the performance of LLMs in automatic conflict resolution. To address these issues, we introduce ConGra, a CONflict-GRAded benchmarking scheme designed to evaluate the performance of…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- Impressive merge conflict data collection established - Paper establishes that LLMs with chain of thought can do a decent job in resolving merge conflicts
In section 3.1 I was confused about the way in which merges are detected. It turns out this is explained on page 6 in the first paragraph of section 4. I think this belongs in section 3, as it is independent of the nine systems chosen. In Section 5 I was disappointed by the fact that comparison with existing merging tools was considered infeasible. While I can follow the reasoning to some extent, this does undermine the whole purpose of having a benchmark. In particular, the abstract states:
1. The construction of a substantial dataset containing 44,948 conflict cases from 34 real-world projects across multiple programming languages (C, C++, Java, and Python) is a valuable contribution to the field. Using real-world conflict cases from open-source projects is appropriate and lends credibility to the evaluation. 2. It introduces a new method for classifying code merge conflicts based on code operations extracted from syntax trees, enabling a complexity-graded dataset. 3. It provides
1. The paper claims generalist LLMs outperform specialized code LLMs in automatic conflict resolution tasks. However, the code-focused models evaluated are relatively older and potentially less capable than the general-purpose LLMs used. Including recent code-specific LLMs would strengthen the authors' claims and provide a fairer basis for comparison. 2. The authors' explanations for their findings lack sufficient depth and appear speculative. For example, the claim that long-context models unde
+ A new dataset for evaluating ACR tools + Performance and limitations of existing LLMs on the new data were explored
- motivation of labeling conflict with the complexity scenario info is unclear - dataset construction lacks transparency about the criteria used for project selection - complexity type info (e.g., Text, Functional, Syntax) is too coarse-grained to be practical - missing benchmarking existing machine learning-based ACR (e.g., DeepMerge, MergeGen) - lack of deep analysis regarding the pros and cons of examined LLMs - focused on limited syntax problems (i.e., declarations or definitions only)
1) The paper tackles an important software engineering problem -- automated merge conflict resolution 2) The effort promotes open science, by promising to open source the benchmark which could help drive the research in automated program merge
The benchmark matching criteria are not optimal, and are not well aligned with the evaluation metrics used in the existing merge conflict research. It appears to be more optimized towards a typical code in-filling task. More specifically, CONGRA regards the resolution candidate matching the ground truth when similarity is greater than 80%. To my opinion, this is a major issue and I see two possible solutions: 1) require a stricter 100% syntactic match (modulo the white space and indentation). A
+: This paper first investigates the performance of LLMs in resolving software version conflicts. +: Proposed a conflict dataset in multiple programming languages.
-: The authors only adopt a few models for evaluation, which makes their conclusion less convincing. For example, it is better to include results for more proprietary models, such as GPT-4o. -: The authors stated that GPT-4o-mini is only evaluated on the Java subset due to its token limitation, but the context window of GPT-4o-mini is 128K, which is comparable to other models in the evaluation. -: The finding "general-purpose LLMs outperform code LLMs" may be inaccurate. The code LLMs in this
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Agent-Based Network Management
