CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual   Question Answering

Yuanyuan Jiang; Jianqin Yin

arXiv:2405.07451·cs.CV·May 14, 2024

CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering

Yuanyuan Jiang, Jianqin Yin

PDF

Open Access

TL;DR

This paper introduces a novel CLIP-powered single-stream network for audio-visual question answering that effectively unifies audio and visual modalities and leverages image-text matching knowledge for improved reasoning.

Contribution

It proposes a target-aware spatial grounding module and a unified temporal grounding module that utilize CLIP's knowledge for better audio-visual reasoning in AVQA.

Findings

01

Outperforms existing state-of-the-art methods on MUSIC-AVQA benchmark.

02

Effectively transfers image-text matching knowledge to audio-visual tasks.

03

Unifies audio and visual modalities in a single-stream architecture.

Abstract

While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the image-text matching knowledge of the pretrained model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and dialogue systems · Speech Recognition and Synthesis

MethodsContrastive Language-Image Pre-training