ComSum: Commit Messages Summarization and Meaning Preservation
Leshem Choshen, Idan Amit

TL;DR
ComSum is a large dataset of 7 million commit messages designed for text summarization, emphasizing meaning preservation over traditional metrics, to improve software documentation tools.
Contribution
The paper introduces ComSum, a novel large-scale dataset for commit message summarization that incorporates meaning preservation evaluation methods.
Findings
Dataset contains 7 million commit messages.
Meaning preservation is proposed as an evaluation metric.
The dataset enhances empirical software engineering research.
Abstract
We present ComSum, a data set of 7 million commit messages for text summarization. When documenting commits, software code changes, both a message and its summary are posted. We gather and filter those to curate developers' work summarization data set. Along with its growing size, practicality and challenging language domain, the data set benefits from the living field of empirical software engineering. As commits follow a typology, we propose to not only evaluate outputs by Rouge, but by their meaning preservation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Natural Language Processing Techniques
