Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
Mohamad Khajezade, Fatemeh H. Fard, and Mohamed Sami Shehata

TL;DR
This paper introduces a knowledge distillation framework with response stabilization techniques to enhance the reliability and performance of compact open-source models for cross-language code clone detection, addressing limitations of large language models.
Contribution
It proposes a novel distillation and stabilization approach that improves the reasoning ability, reliability, and efficiency of small models for cross-language code clone detection.
Findings
Distillation improves model reliability across multiple language pairs.
Response stabilization methods increase inference speed and consistency.
Models perform well even under distribution shift.
Abstract
Cross-language code clone detection (X-CCD) is challenging because semantically equivalent programs written in different languages often share little surface similarity. Although large language models (LLMs) have shown promise for semantic clone detection, their use as black-box systems raises concerns about cost, reproducibility, privacy, and unreliable output formatting. In particular, compact open-source models often struggle to follow reasoning-oriented prompts and to produce outputs that can be consistently mapped to binary clone labels. To address these limitations, we propose a knowledge distillation framework that transfers reasoning capabilities from DeepSeek-R1 into compact open-source student models for X-CCD. Using cross-language code pairs derived from Project CodeNet, we construct reasoning-oriented synthetic training data and fine-tune Phi3 and Qwen-Coder with LoRA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
