An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models
Zixiang Xian, Chenhui Cui, Rubing Huang, Chunrong Fang, Zhenyu Chen

TL;DR
This paper introduces a novel method for embedding source code by combining large language models and sentence embedding models, eliminating the need for task-specific training and improving robustness against errors.
Contribution
It proposes a new unsupervised approach that integrates LLMs with sentence embeddings for source code representation, outperforming existing methods without fine-tuning.
Findings
Outperforms state-of-the-art unsupervised methods
Effective across multiple programming languages
Reduces reliance on supervised training
Abstract
The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code clustering. However, existing methods for source code embedding, including those based on LLMs, often rely on costly supervised training or fine-tuning for domain adaptation. This paper proposes a novel approach to embedding source code by combining large language and sentence embedding models. This approach attempts to eliminate the need for task-specific training or fine-tuning and to effectively address the issue of erroneous information commonly found in LLM-generated outputs. To evaluate the performance of our proposed approach, we conducted a series of experiments on three datasets with different programming languages by considering various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
