Distraction-free Embeddings for Robust VQA

Atharvan Dogra; Deeksha Varshney; Ashwin Kalyan; Ameet Deshpande,; Neeraj Kumar

arXiv:2309.00133·cs.CV·September 4, 2023

Distraction-free Embeddings for Robust VQA

Atharvan Dogra, Deeksha Varshney, Ashwin Kalyan, Ameet Deshpande,, Neeraj Kumar

PDF

Open Access

TL;DR

This paper introduces DRAX, a novel distraction removal method for cross-modal embeddings in VQA, improving focus on relevant information and semantic alignment, leading to better understanding in complex video question answering tasks.

Contribution

The paper proposes DRAX, a new distraction removal technique that enhances latent representations for VQA by focusing on relevant information and ensuring semantic alignment during cross-modal fusion.

Findings

01

Improved performance on SUTD-TrafficQA benchmark.

02

Enhanced focus on task-relevant information in embeddings.

03

Better handling of temporal and causal reasoning tasks.

Abstract

The generation of effective latent representations and their subsequent refinement to incorporate precise information is an essential prerequisite for Vision-Language Understanding (VLU) tasks such as Video Question Answering (VQA). However, most existing methods for VLU focus on sparsely sampling or fine-graining the input information (e.g., sampling a sparse set of frames or text tokens), or adding external knowledge. We present a novel "DRAX: Distraction Removal and Attended Cross-Alignment" method to rid our cross-modal representations of distractors in the latent space. We do not exclusively confine the perception of any input information from various modalities but instead use an attention-guided distraction removal method to increase focus on task-relevant information in latent embeddings. DRAX also ensures semantic alignment of embeddings during cross-modal fusions. We evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsFocus