Dynamic Thresholding Mechanisms for IR-Based Filtering in Efficient Source Code Plagiarism Detection
Oscar Karnalim, Lisan Sulistiani

TL;DR
This paper introduces two dynamic thresholding mechanisms for IR-based source code plagiarism detection, improving efficiency by adaptively selecting similarity thresholds based on similarity distribution, outperforming manual threshold setting.
Contribution
It proposes range-based and pair-count-based dynamic thresholding mechanisms that adaptively tune similarity thresholds for more efficient and effective source code plagiarism detection.
Findings
Both mechanisms outperform manual threshold setting in efficiency and effectiveness.
They are more practical and proportional to performance improvements.
Evaluation shows significant efficiency gains with minimal effectiveness loss.
Abstract
To solve time inefficiency issue, only potential pairs are compared in string-matching-based source code plagiarism detection; wherein potentiality is defined through a fast-yet-order-insensitive similarity measurement (adapted from Information Retrieval) and only pairs which similarity degrees are higher or equal to a particular threshold is selected. Defining such threshold is not a trivial task considering the threshold should lead to high efficiency improvement and low effectiveness reduction (if it is unavoidable). This paper proposes two thresholding mechanisms---namely range-based and pair-count-based mechanism---that dynamically tune the threshold based on the distribution of resulted similarity degrees. According to our evaluation, both mechanisms are more practical to be used than manual threshold assignment since they are more proportional to efficiency improvement and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Spam and Phishing Detection
