Integrating Query-aware Segmentation and Cross-Attention for Robust VQA

Wonjun Choi; Sangbeom Lee; Seungyeon Lee; Heechul Jung; Dong-Gyu; Lee

arXiv:2407.12055·cs.CV·July 18, 2024

Integrating Query-aware Segmentation and Cross-Attention for Robust VQA

Wonjun Choi, Sangbeom Lee, Seungyeon Lee, Heechul Jung, Dong-Gyu, Lee

PDF

Open Access

TL;DR

This paper presents a novel approach for Visual Question Answering that combines query-aware segmentation, cross-attention, and ensemble techniques to improve robustness and accuracy on VizWiz-VQA tasks.

Contribution

It introduces a new method integrating query-aware segmentation and cross-attention with ensemble strategies, utilizing LVLM, CLIPSeg, and ViT features for enhanced VQA performance.

Findings

01

Improved accuracy on VizWiz-VQA dataset.

02

Effective use of CLIPSeg for image enhancement.

03

Ensemble based on Levenshtein distance boosts prediction quality.

Abstract

This paper introduces a method for VizWiz-VQA using LVLM with trainable cross-attention and LoRA finetuning. We train the model with the following conditions: 1) Training with original images. 2) Training with enhanced images using CLIPSeg to highlight or contrast the original image. 3) Training with integrating the output features of Vision Transformer (ViT) and CLIPSeg features of the original images. Then, we ensemble the results based on Levenshtein distance to enhance the prediction of the final answer. In the experiments, we demonstrate and analyze the proposed method's effectiveness.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection