Aligning Programming Language and Natural Language: Exploring Design   Choices in Multi-Modal Transformer-Based Embedding for Bug Localization

Partha Chakraborty; Venkatraman Arumugam; Meiyappan Nagappan

arXiv:2406.17615·cs.SE·June 26, 2024

Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug Localization

Partha Chakraborty, Venkatraman Arumugam, Meiyappan Nagappan

PDF

1 Repo

TL;DR

This paper investigates how different design choices in multi-modal transformer-based embeddings affect bug localization performance, highlighting the importance of pre-training and data familiarity in cross-project scenarios.

Contribution

It systematically evaluates 14 embedding models and analyzes how design decisions impact bug localization accuracy, providing insights into optimal embedding strategies.

Findings

01

Pre-training strategies significantly influence embedding quality.

02

Data familiarity impacts bug localization performance.

03

Cross-project bug localization performance varies greatly.

Abstract

Bug localization refers to the identification of source code files which is in a programming language and also responsible for the unexpected behavior of software using the bug report, which is a natural language. As bug localization is labor-intensive, bug localization models are employed to assist software developers. Due to the domain difference between source code files and bug reports, modern bug-localization systems, based on deep learning models, rely heavily on embedding techniques that project bug reports and source code files into a shared vector space. The creation of an embedding involves several design choices, but the impact of these choices on the quality of embedding and the performance of bug localization models remains unexplained in current research. To address this gap, our study evaluated 14 distinct embedding models to gain insights into the effects of various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://zenodo.org/record/10519746
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.