Identifier-Free Code Embedding Models for Scalable Search

Eric Wolos; Michael Doyle

arXiv:2605.05251·cs.CR·May 8, 2026

Identifier-Free Code Embedding Models for Scalable Search

Eric Wolos, Michael Doyle

PDF

TL;DR

This paper introduces a new AI-based embedding model for bidirectional function association between source code and decompiled code, improving scalability and accuracy in reverse engineering tasks.

Contribution

It formalizes the function association problem and demonstrates that fine-tuning a Qwen3-Embedding model with contrastive learning significantly enhances performance.

Findings

01

Model outperforms existing baselines on all function association tasks.

02

Model generalizes to a constant-algorithm association task without explicit training.

03

Fine-tuning with contrastive learning improves bidirectional association accuracy.

Abstract

Function association is a useful process for binary reverse engineers. Search tools exist to perform association at scale, but they do not utilize the full range of capabilities that AI-enabled search provides. Prior work has explored the development of embedding models for association between certain reverse engineering code representations, but that work does not cover bidirectional association between source code and decompiled, stripped code with standard preprocessing requirements. To bridge this gap, we formalize this function association problem and evaluate the extent to which embedding models can bidirectionally associate between these two representations. To improve model performance at this task, we fine-tune a Qwen3-Embedding model with contrastive learning. We find that our new model outperforms other models on all function association baselines by a substantial margin and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.