EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild

Yiming Fan (1); Jun Yeon Won (1); Ding Zhu (1); Melih Sirlanci (1); Mahdi Khalili (1); and Carter Yagemann (1) ((1) The Ohio State University)

arXiv:2604.01554·cs.CR·April 3, 2026

EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild

Yiming Fan (1), Jun Yeon Won (1), Ding Zhu (1), Melih Sirlanci (1), Mahdi Khalili (1), and Carter Yagemann (1) ((1) The Ohio State University)

PDF

TL;DR

EXHIB is a comprehensive benchmark with five real-world datasets for evaluating binary function similarity detection models, revealing significant generalization gaps and robustness issues.

Contribution

It introduces a diverse, realistic benchmark for BFSD, enabling better evaluation of models across different application scenarios.

Findings

01

Models show up to 30% performance degradation on firmware and semantic datasets.

02

Current models lack robustness to high-level semantic differences.

03

Existing evaluation practices overlook critical generalization challenges.

Abstract

Binary Function Similarity Detection (BFSD) is a core problem in software security, supporting tasks such as vulnerability analysis, malware classification, and patch provenance. In the past few decades, numerous models and tools have been developed for this application; however, due to the lack of a comprehensive universal benchmark in this field, researchers have struggled to compare different models effectively. Existing datasets are limited in scope, often focusing on a narrow set of transformations or types of binaries, and fail to reflect the full diversity of real-world applications. We introduce EXHIB, a benchmark comprising five realistic datasets collected from the wild, each highlighting a distinct aspect of the BFSD problem space. We evaluate 9 representative models spanning multiple BFSD paradigms on EXHIB and observe performance degradations of up to 30% on firmware and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.