Towards a Measure of Algorithm Similarity
Shairoz Sohail, Taher Ali

TL;DR
This paper introduces EMOC, a framework for embedding algorithms into a feature space to measure their similarity, supporting tasks like clustering, duplicate detection, and diversity quantification.
Contribution
It presents EMOC, a novel embedding method for algorithms, and provides PACD, a curated dataset of Python implementations for evaluating similarity measures.
Findings
EMOC features enable effective clustering and classification of algorithms.
The framework detects near-duplicate algorithms and measures diversity.
Code and datasets are publicly released for reproducibility.
Abstract
Given two algorithms for the same problem, can we determine whether they are meaningfully different? In full generality, the question is uncomputable, and empirically it is muddied by competing notions of similarity. Yet, in many applications (such as clone detection or program synthesis) a pragmatic and consistent similarity metric is necessary. We review existing equivalence and similarity notions and introduce EMOC: An Evaluation-Memory-Operations-Complexity framework that embeds algorithm implementations into a feature space suitable for downstream tasks. We compile PACD, a curated dataset of verified Python implementations across three problems, and show that EMOC features support clustering and classification of algorithm types, detection of near-duplicates, and quantification of diversity in LLM-generated programs. Code, data, and utilities for computing EMOC embeddings are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Machine Learning and Data Classification · Computability, Logic, AI Algorithms
