Distraction-free Embeddings for Robust VQA
Atharvan Dogra, Deeksha Varshney, Ashwin Kalyan, Ameet Deshpande,, Neeraj Kumar

TL;DR
This paper introduces DRAX, a novel distraction removal method for cross-modal embeddings in VQA, improving focus on relevant information and semantic alignment, leading to better understanding in complex video question answering tasks.
Contribution
The paper proposes DRAX, a new distraction removal technique that enhances latent representations for VQA by focusing on relevant information and ensuring semantic alignment during cross-modal fusion.
Findings
Improved performance on SUTD-TrafficQA benchmark.
Enhanced focus on task-relevant information in embeddings.
Better handling of temporal and causal reasoning tasks.
Abstract
The generation of effective latent representations and their subsequent refinement to incorporate precise information is an essential prerequisite for Vision-Language Understanding (VLU) tasks such as Video Question Answering (VQA). However, most existing methods for VLU focus on sparsely sampling or fine-graining the input information (e.g., sampling a sparse set of frames or text tokens), or adding external knowledge. We present a novel "DRAX: Distraction Removal and Attended Cross-Alignment" method to rid our cross-modal representations of distractors in the latent space. We do not exclusively confine the perception of any input information from various modalities but instead use an attention-guided distraction removal method to increase focus on task-relevant information in latent embeddings. DRAX also ensures semantic alignment of embeddings during cross-modal fusions. We evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsFocus
