Active Perception Agent for Omnimodal Audio-Video Understanding

Keda Tao; Wenjie Du; Bohan Yu; Weiqiang Wang; Jian Liu; Huan Wang

arXiv:2512.23646·cs.CV·February 6, 2026

Active Perception Agent for Omnimodal Audio-Video Understanding

Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang

PDF

Open Access

TL;DR

OmniAgent is a novel active perception system that dynamically orchestrates unimodal tools for fine-grained audio-visual understanding, achieving state-of-the-art results without additional training.

Contribution

It introduces the first fully active perception agent for omnimodal reasoning, shifting from passive to active multimodal inquiry with a novel audio-guided perception paradigm.

Findings

01

Achieves 10-20% higher accuracy on benchmarks.

02

Outperforms existing models without training.

03

Demonstrates effective dynamic tool orchestration.

Abstract

Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often face challenges in fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, to our best knowledge, the first fully active perception agent that dynamically orchestrates specialized unimodal tools to achieve more fine-grained omnimodal reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, we demonstrate a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing