TL;DR
ReFEree is a novel reference-free, segment-level evaluation method for assessing factual consistency in real-world code summaries, outperforming previous approaches by aligning closely with human judgment.
Contribution
It introduces a fine-grained, dependency-aware evaluation framework specifically designed for multi-sentence code summaries, with a new benchmark and improved correlation with human assessments.
Findings
ReFEree achieves the highest correlation with human judgment among 13 baselines.
It improves over previous state-of-the-art by 15-18%.
The method effectively evaluates factual consistency at the segment level.
Abstract
As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine-grained evaluation of multi-sentence functionalities and fail to accurately assess dependency context commonly found in real-world code summaries. To address this, we propose ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. We construct a code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
