CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
Yuanyuan Jiang, Jianqin Yin

TL;DR
This paper introduces a novel CLIP-powered single-stream network for audio-visual question answering that effectively unifies audio and visual modalities and leverages image-text matching knowledge for improved reasoning.
Contribution
It proposes a target-aware spatial grounding module and a unified temporal grounding module that utilize CLIP's knowledge for better audio-visual reasoning in AVQA.
Findings
Outperforms existing state-of-the-art methods on MUSIC-AVQA benchmark.
Effectively transfers image-text matching knowledge to audio-visual tasks.
Unifies audio and visual modalities in a single-stream architecture.
Abstract
While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the image-text matching knowledge of the pretrained model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and dialogue systems · Speech Recognition and Synthesis
MethodsContrastive Language-Image Pre-training
