You Augment Me: Exploring ChatGPT-based Data Augmentation for Semantic Code Search
Yanlin Wang, Lianghong Guo, Ensheng Shi, Wenqing Chen, Jiachi Chen,, Wanjun Zhong, Menghan Wang, Hui Li, Hongyu Zhang, Ziyu Lyu, Zibin Zheng

TL;DR
This paper introduces ChatDANCE, a novel data augmentation method using ChatGPT to generate high-quality code search data, significantly improving model performance and enabling the model to self-improve through an augment-filter-retrain strategy.
Contribution
The paper presents a new ChatGPT-based data augmentation approach for semantic code search, including a filtering mechanism and a self-growth strategy for the backbone model, achieving state-of-the-art results.
Findings
Achieves 13.2% improvement in R@1 over baseline.
Effective filtering enhances data quality and model performance.
Model learns more uniform and aligned code-query representations.
Abstract
Code search plays a crucial role in software development, enabling developers to retrieve and reuse code using natural language queries. While the performance of code search models improves with an increase in high-quality data, obtaining such data can be challenging and expensive. Recently, large language models (LLMs) such as ChatGPT have made remarkable progress in both natural and programming language understanding and generation, offering user-friendly interaction via simple prompts. Inspired by these advancements, we propose a novel approach ChatDANCE, which utilizes high-quality and diverse augmented data generated by a large language model and leverages a filtering mechanism to eliminate low-quality augmentations. Specifically, we first propose a set of ChatGPT prompting rules that are specifically designed for source code and queries. Then, we leverage ChatGPT to rewrite code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · FinTech, Crowdfunding, Digital Finance · Recommender Systems and Techniques
