Identifier-Free Code Embedding Models for Scalable Search
Eric Wolos, Michael Doyle

TL;DR
This paper introduces a new AI-based embedding model for bidirectional function association between source code and decompiled code, improving scalability and accuracy in reverse engineering tasks.
Contribution
It formalizes the function association problem and demonstrates that fine-tuning a Qwen3-Embedding model with contrastive learning significantly enhances performance.
Findings
Model outperforms existing baselines on all function association tasks.
Model generalizes to a constant-algorithm association task without explicit training.
Fine-tuning with contrastive learning improves bidirectional association accuracy.
Abstract
Function association is a useful process for binary reverse engineers. Search tools exist to perform association at scale, but they do not utilize the full range of capabilities that AI-enabled search provides. Prior work has explored the development of embedding models for association between certain reverse engineering code representations, but that work does not cover bidirectional association between source code and decompiled, stripped code with standard preprocessing requirements. To bridge this gap, we formalize this function association problem and evaluate the extent to which embedding models can bidirectionally associate between these two representations. To improve model performance at this task, we fine-tune a Qwen3-Embedding model with contrastive learning. We find that our new model outperforms other models on all function association baselines by a substantial margin and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
