CoreGen: Contextualized Code Representation Learning for Commit Message Generation
Lun Yiu Nie, Cuiyun Gao, Zhicong Zhong, Wai Lam, Yang Liu, Zenglin, Xu

TL;DR
CoreGen introduces a novel method for commit message generation that leverages contextualized code representations using Transformer models, significantly improving performance over existing static embedding approaches.
Contribution
The paper proposes a new contextualized code representation learning strategy for commit message generation, addressing the semantic gap with a Transformer-based approach.
Findings
At least 28.18% improvement in BLEU-4 score over baselines
Effective utilization of contextual information in code representations
Potential for broader application in code-to-text tasks
Abstract
Automatic generation of high-quality commit messages for code commits can substantially facilitate software developers' works and coordination. However, the semantic gap between source code and natural language poses a major challenge for the task. Several studies have been proposed to alleviate the challenge but none explicitly involves code contextual information during commit message generation. Specifically, existing research adopts static embedding for code tokens, which maps a token to the same vector regardless of its context. In this paper, we propose a novel Contextualized code representation learning strategy for commit message Generation (CoreGen). CoreGen first learns contextualized code representations which exploit the contextual information behind code commit sequences. The learned representations of code commits built upon Transformer are then fine-tuned for downstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Software System Performance and Reliability
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Layer Normalization · Attention Is All You Need · Softmax · Label Smoothing · Byte Pair Encoding
