Active Perception Agent for Omnimodal Audio-Video Understanding
Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang

TL;DR
OmniAgent is a novel active perception system that dynamically orchestrates unimodal tools for fine-grained audio-visual understanding, achieving state-of-the-art results without additional training.
Contribution
It introduces the first fully active perception agent for omnimodal reasoning, shifting from passive to active multimodal inquiry with a novel audio-guided perception paradigm.
Findings
Achieves 10-20% higher accuracy on benchmarks.
Outperforms existing models without training.
Demonstrates effective dynamic tool orchestration.
Abstract
Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often face challenges in fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, to our best knowledge, the first fully active perception agent that dynamically orchestrates specialized unimodal tools to achieve more fine-grained omnimodal reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, we demonstrate a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
