Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection

Mohamad Khajezade; Fatemeh H. Fard; and Mohamed Sami Shehata

arXiv:2605.02860·cs.AI·May 6, 2026

Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection

Mohamad Khajezade, Fatemeh H. Fard, and Mohamed Sami Shehata

PDF

TL;DR

This paper introduces a knowledge distillation framework with response stabilization techniques to enhance the reliability and performance of compact open-source models for cross-language code clone detection, addressing limitations of large language models.

Contribution

It proposes a novel distillation and stabilization approach that improves the reasoning ability, reliability, and efficiency of small models for cross-language code clone detection.

Findings

01

Distillation improves model reliability across multiple language pairs.

02

Response stabilization methods increase inference speed and consistency.

03

Models perform well even under distribution shift.

Abstract

Cross-language code clone detection (X-CCD) is challenging because semantically equivalent programs written in different languages often share little surface similarity. Although large language models (LLMs) have shown promise for semantic clone detection, their use as black-box systems raises concerns about cost, reproducibility, privacy, and unreliable output formatting. In particular, compact open-source models often struggle to follow reasoning-oriented prompts and to produce outputs that can be consistently mapped to binary clone labels. To address these limitations, we propose a knowledge distillation framework that transfers reasoning capabilities from DeepSeek-R1 into compact open-source student models for X-CCD. Using cross-language code pairs derived from Project CodeNet, we construct reasoning-oriented synthetic training data and fine-tune Phi3 and Qwen-Coder with LoRA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.