The Struggles of LLMs in Cross-lingual Code Clone Detection

Micheline B\'en\'edicte Moumoula; Abdoul Kader Kabore; Jacques; Klein; Tegawend\'e Bissyande

arXiv:2408.04430·cs.SE·May 7, 2025

The Struggles of LLMs in Cross-lingual Code Clone Detection

Micheline B\'en\'edicte Moumoula, Abdoul Kader Kabore, Jacques, Klein, Tegawend\'e Bissyande

PDF

1 Repo

TL;DR

This paper evaluates the effectiveness of Large Language Models and embedding models in cross-lingual code clone detection, revealing that embedding models outperform LLMs in accuracy, especially on complex code examples.

Contribution

It provides a comprehensive comparison of LLMs and embedding models for cross-lingual code clone detection, highlighting the superior performance of embedding-based representations.

Findings

01

LLMs achieve high F1 scores (~0.99) on simple code examples.

02

Embedding models outperform LLMs by 1-20 percentage points on benchmark datasets.

03

Embedding representations enable more accurate classification of code clones across languages.

Abstract

With the involvement of multiple programming languages in modern software development, cross-lingual code clone detection has gained traction within the software engineering community. Numerous studies have explored this topic, proposing various promising approaches. Inspired by the significant advances in machine learning in recent years, particularly Large Language Models (LLMs), which have demonstrated their ability to tackle various tasks, this paper revisits cross-lingual code clone detection. We evaluate the performance of five (05) LLMs and eight prompts (08) for the identification of cross-lingual code clones. Additionally, we compare these results against two baseline methods. Finally, we evaluate a pre-trained embedding model to assess the effectiveness of the generated representations for classifying clone and non-clone pairs. The studies involving LLMs and Embedding models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trux-dtf/clccd
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.