LLMs as Evaluators: A Novel Approach to Evaluate Bug Report   Summarization

Abhishek Kumar; Sonia Haiduc; Partha Pratim Das; Partha Pratim; Chakrabarti

arXiv:2409.00630·cs.SE·September 4, 2024

LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization

Abhishek Kumar, Sonia Haiduc, Partha Pratim Das, Partha Pratim, Chakrabarti

PDF

Open Access

TL;DR

This paper explores the use of Large Language Models as automated evaluators for bug report summarization, demonstrating their potential to match human judgment and reduce evaluation fatigue.

Contribution

It introduces a novel approach of using LLMs for evaluating bug report summaries, comparing their performance to human evaluators in a controlled experiment.

Findings

01

LLMs performed well in evaluating bug report summaries

02

GPT-4o outperformed other LLMs in accuracy

03

LLMs showed consistent decision-making similar to humans

Abstract

Summarizing software artifacts is an important task that has been thoroughly researched. For evaluating software summarization approaches, human judgment is still the most trusted evaluation. However, it is time-consuming and fatiguing for evaluators, making it challenging to scale and reproduce. Large Language Models (LLMs) have demonstrated remarkable capabilities in various software engineering tasks, motivating us to explore their potential as automatic evaluators for approaches that aim to summarize software artifacts. In this study, we investigate whether LLMs can evaluate bug report summarization effectively. We conducted an experiment in which we presented the same set of bug summarization problems to humans and three LLMs (GPT-4o, LLaMA-3, and Gemini) for evaluation on two tasks: selecting the correct bug report title and bug report summary from a set of options. Our results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Software Testing and Debugging Techniques