An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models

Zixiang Xian; Chenhui Cui; Rubing Huang; Chunrong Fang; Zhenyu Chen

arXiv:2409.14644·cs.SE·June 4, 2025

An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models

Zixiang Xian, Chenhui Cui, Rubing Huang, Chunrong Fang, Zhenyu Chen

PDF

Open Access

TL;DR

This paper introduces a novel method for embedding source code by combining large language models and sentence embedding models, eliminating the need for task-specific training and improving robustness against errors.

Contribution

It proposes a new unsupervised approach that integrates LLMs with sentence embeddings for source code representation, outperforming existing methods without fine-tuning.

Findings

01

Outperforms state-of-the-art unsupervised methods

02

Effective across multiple programming languages

03

Reduces reliance on supervised training

Abstract

The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code clustering. However, existing methods for source code embedding, including those based on LLMs, often rely on costly supervised training or fine-tuning for domain adaptation. This paper proposes a novel approach to embedding source code by combining large language and sentence embedding models. This approach attempts to eliminate the need for task-specific training or fine-tuning and to effectively address the issue of erroneous information commonly found in LLM-generated outputs. To evaluate the performance of our proposed approach, we conducted a series of experiments on three datasets with different programming languages by considering various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis