Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

Tao Yu; yiming ding; Shenghua Chai; Minghui Zhang; Zhongtian Luo; Xinming Wang; Xinlong Chen; Zhaolu Kang; Junhao Gong; Yuxuan Zhou; Haopeng Jin; Zhiqing Cui; Jiabing Yang; YiFan Zhang; Hongzhu Yi; Zheqi He; Xi Yang; Yan Huang; Liang Wang

arXiv:2605.08762·cs.SD·May 12, 2026

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

Tao Yu, yiming ding, Shenghua Chai, Minghui Zhang, Zhongtian Luo, Xinming Wang, Xinlong Chen, Zhaolu Kang, Junhao Gong, Yuxuan Zhou, Haopeng Jin, Zhiqing Cui, Jiabing Yang, YiFan Zhang, Hongzhu Yi, Zheqi He, Xi Yang, Yan Huang, Liang Wang

PDF

TL;DR

This paper introduces Omni-DeepSearch, a new benchmark for evaluating models that perform audio-driven omni-modal deep search involving multi-hop reasoning across multiple modalities.

Contribution

It presents a novel benchmark with a multi-stage filtering pipeline to evaluate the ability of models to perform audio-based cross-modal search and reasoning.

Findings

01

The strongest model achieves only 43.44% accuracy, indicating high task difficulty.

02

Key bottlenecks include audio entity inference, query formulation, and multi-hop retrieval.

03

The benchmark reveals significant challenges in audio-driven omni-modal reasoning.

Abstract

Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce \textbf{Omni-DeepSearch}, a benchmark for audio-driven omni-modal deep search. Given one or more audio clips and a related question, models must infer useful clues from audio, invoke text, image, and video search tools, and perform multi-hop reasoning to produce a short, objective, and verifiable answer. Omni-DeepSearch contains 640 samples across 15 fine-grained categories, covering four retrieval target modalities and four audio content types. A multi-stage filtering pipeline ensures audio dependence, retrieval necessity, visual modality necessity, and answer uniqueness. Experiments on recent closed-source and open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.