Rethinking Code Similarity for Automated Algorithm Design with LLMs
Rui Zhang, Zhichao Lu

TL;DR
This paper introduces BehaveSim, a novel method for measuring algorithmic similarity based on problem-solving behavior trajectories, improving automated algorithm design and enabling systematic analysis of AI-generated algorithms.
Contribution
BehaveSim provides a new way to assess algorithmic similarity through behavior trajectories, enhancing LLM-AAD and facilitating analysis of generated algorithms.
Findings
BehaveSim improves diversity and performance in LLM-AAD tasks.
It effectively clusters algorithms by behavior, aiding analysis.
Open-source implementation available.
Abstract
The rise of Large Language Model-based Automated Algorithm Design (LLM-AAD) has transformed algorithm development by autonomously generating code implementations of expert-level algorithms. Unlike traditional expert-driven algorithm development, in the LLM-AAD paradigm, the main design principle behind an algorithm is often implicitly embedded in the generated code. Therefore, assessing algorithmic similarity directly from code, distinguishing genuine algorithmic innovation from mere syntactic variation, becomes essential. While various code similarity metrics exist, they fail to capture algorithmic similarity, as they focus on surface-level syntax or output equivalence rather than the underlying algorithmic logic. We propose BehaveSim, a novel method to measure algorithmic similarity through the lens of problem-solving behavior as a sequence of intermediate solutions produced during…
Peer Reviews
Decision·ICLR 2026 Poster
[1] The paper addresses an important and overlooked gap by redefining algorithm similarity from the behavioral viewpoint. This is a clear and original contribution. [2] The distinction among code-level, behavior-level, and result-level similarity is well presented, and the taxonomy of four types of algorithm pairs is intuitive and pedagogically useful. [3] The idea of representing algorithm behavior through trajectories and computing DTW-based similarity is theoretically grounded and applicabl
[1] BehaveSim is currently designed for iterative algorithms only. It cannot yet handle recursive, dynamic programming, or machine-learning-based algorithms. This limits its generality. [2] Several heuristic parameters, such as trajectory truncation, normalization constants, and distance scaling, are not systematically analyzed. Their influence on stability and reproducibility is unclear. [3] The benchmark for similarity evaluation mainly includes synthetic or classical algorithm examples. Bro
The curated dataset with 4 algorithm pair types (varying code/behavior/result similarity combinations) provides rigorous validation. The results clearly demonstrate that BehaveSim achieves 1.0 similarity on Type-3 pairs (same behavior, different code) while existing code metrics fail, and correctly identifies behavioral differences where code metrics show high similarity.
1. The evaluation methodology does not use any AI models or AI-related methods. BehaveSim is essentially a general algorithm comparison technique based on execution traces and DTW, which appears equally applicable to comparing human-written code. The source of code (LLM-generated versus human-written) seems irrelevant to the core methodology, raising questions about whether this is fundamentally a software engineering contribution rather than an AI/ML contribution suitable for ICLR. 2. The metho
Novel Perspective: The paper identifies a clear conceptual gap between code similarity and algorithmic behavior similarity, proposing an elegant behavioral abstraction based on execution trajectories. This reframing is insightful and well-motivated in the context of LLM-generated algorithms. Concrete Implementation (BehaveSim): The definition of behavioral trajectories and the use of DTW distance provide a simple yet powerful operationalization of behavioral similarity. The methodology is well
Scope Limitation: BehaveSim applies only to iterative algorithms producing discrete trajectories. Many LLM-generated algorithms, including stochastic, differentiable, or recursive paradigms, are excluded. This significantly restricts generality. Metric Design Choices: The use of DTW on normalized edit or Euclidean distances is heuristic; there’s limited justification for why DTW best captures “behavioral similarity.” Ablation on alternative measures (ERP, cosine, etc.) is included but not theor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Time Series Analysis and Forecasting · Topic Modeling
