Cross-modal Retrieval Models for Stripped Binary Analysis
Guoqiang Chen, Lingyun Ying, Ziyang Song, Daguang Liu, Qiang Wang, Zhiqi Wang, Li Hu, Shaoyin Cheng, Weiming Zhang, Nenghai Yu

TL;DR
This paper introduces BinSeek, a two-stage cross-modal retrieval framework that significantly improves the accuracy of retrieving relevant stripped binary code from natural language queries, aiding software security tasks.
Contribution
BinSeek is a novel two-stage retrieval system with a large-scale trained embedding model and a context-aware reranker, plus a new benchmark for binary code and language retrieval.
Findings
Achieved 31.42% improvement in Rec@3 over baseline models.
Surpassed larger models by 27.17% in MRR@3.
Established a domain-specific benchmark for future research.
Abstract
Retrieving binary code via natural language queries is a pivotal capability for downstream tasks in the software security domain, such as vulnerability detection and malware analysis. However, it is challenging to identify binary functions semantically relevant to the user query from thousands of candidates, as the absence of symbolic information distinguishes this task from source code retrieval. In this paper, we introduce, BinSeek, a two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeek-Embedding is trained on large-scale dataset to learn the semantic relevance of the binary code and the natural language description, furthermore, BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data synthesis pipeline to automate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Software Testing and Debugging Techniques
