Improving the Ability of Pre-trained Language Model by Imparting Large   Language Model's Experience

Xin Yin; Chao Ni; Xiaodan Xu; Xinrui Li; Xiaohu Yang

arXiv:2408.08553·cs.SE·January 16, 2025

Improving the Ability of Pre-trained Language Model by Imparting Large Language Model's Experience

Xin Yin, Chao Ni, Xiaodan Xu, Xinrui Li, Xiaohu Yang

PDF

Open Access

TL;DR

This paper proposes leveraging large language models to generate domain-specific data, significantly improving the performance of pre-trained language models on software engineering tasks like fault localization and clone detection.

Contribution

It introduces a novel approach of using LLMs to generate training data for enhancing pre-trained LMs on non-generative software engineering tasks.

Findings

01

LLM-generated data significantly improves model performance.

02

Up to 58.36% improvement in fault localization.

03

Up to 6.09% improvement in clone detection.

Abstract

Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks (e.g., code completion and code generation). By leveraging huge existing code corpora (e.g., GitHub), these models can understand the patterns in source code and use these patterns to predict code properties. However, LLMs under few-shot learning perform poorly on non-generative tasks (e.g., fault localization and vulnerability localization), and fine-tuning LLMs is time-consuming and costly for end users and small organizations. Furthermore, the performance of fine-tuning LMs for non-generative tasks is impressive, yet it heavily depends on the amount and quality of data. As a result, the current lack of data and the high cost of collecting it in real-world scenarios further limit the applicability of LMs. In this paper, we leverage the powerful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling