Towards Efficient Visual-Language Alignment of the Q-Former for Visual   Reasoning Tasks

Sungkyung Kim; Adam Lee; Junyoung Park; Andrew Chung; Jusang Oh,; Jay-Yoon Lee

arXiv:2410.09489·cs.CL·October 15, 2024

Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks

Sungkyung Kim, Adam Lee, Junyoung Park, Andrew Chung, Jusang Oh,, Jay-Yoon Lee

PDF

Open Access 1 Repo

TL;DR

This paper explores parameter efficient fine-tuning of the Q-Former for visual reasoning, showing that it achieves comparable performance to full fine-tuning with significantly fewer trainable parameters, and analyzes the importance of its components.

Contribution

It demonstrates effective PEFT of the Q-Former for visual reasoning and analyzes the importance of its sublayers using AdaLoRA, providing insights into model component significance.

Findings

01

Self-attention layers are crucial for perceptual visual-language reasoning.

02

FFN layer importance varies with task complexity.

03

PEFT achieves similar performance to full fine-tuning with less than 2% trainable parameters.

Abstract

Recent advancements in large language models have demonstrated enhanced capabilities in visual reasoning tasks by employing additional encoders for aligning different modalities. While the Q-Former has been widely used as a general encoder for aligning several modalities including image, video, audio, and 3D with large language models, previous works on its efficient training and the analysis of its individual components have been limited. In this work, we investigate the effectiveness of parameter efficient fine-tuning (PEFT) the Q-Former using InstructBLIP with visual reasoning benchmarks ScienceQA and IconQA. We observe that applying PEFT to the Q-Former achieves comparable performance to full fine-tuning using under 2% of the trainable parameters. Additionally, we employ AdaLoRA for dynamic parameter budget reallocation to examine the relative importance of the Q-Former's sublayers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

attentionx/instructblip_peft
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Data Visualization and Analytics