Declaration-based Prompt Tuning for Visual Question Answering

Yuhang Liu; Wei Wei; Daowan Peng; Feida Zhu

arXiv:2205.02456·cs.CV·May 6, 2022·5 cites

Declaration-based Prompt Tuning for Visual Question Answering

Yuhang Liu, Wei Wei, Daowan Peng, Feida Zhu

PDF

Open Access 1 Repo

TL;DR

This paper introduces Declaration-based Prompt Tuning (DPT), a novel approach that unifies pre-training and fine-tuning objectives for VQA, significantly improving accuracy and generalization, especially in low-data scenarios.

Contribution

DPT reformulates VQA as a prompt-tuning task with joint optimization of pre-training and fine-tuning objectives, enhancing model adaptation and performance.

Findings

01

DPT outperforms traditional fine-tuning by 2.68% in accuracy on GQA.

02

DPT achieves over 31% improvement in zero-shot and few-shot settings.

03

Experimental results demonstrate DPT's effectiveness in both fully-supervised and low-data scenarios.

Abstract

In recent years, the pre-training-then-fine-tuning paradigm has yielded immense success on a wide spectrum of cross-modal tasks, such as visual question answering (VQA), in which a visual-language (VL) model is first optimized via self-supervised task objectives, e.g., masked language modeling (MLM) and image-text matching (ITM), and then fine-tuned to adapt to downstream task (e.g., VQA) via a brand-new objective function, e.g., answer prediction. The inconsistency of the objective forms not only severely limits the generalization of pre-trained VL models to downstream tasks, but also requires a large amount of labeled data for fine-tuning. To alleviate the problem, we propose an innovative VL fine-tuning paradigm (named Declaration-based Prompt Tuning, abbreviated as DPT), which jointly optimizes the objectives of pre-training and fine-tuning of VQA model, boosting the effective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cciiplab/dpt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Linear Layer · Six Ways To Communicate To Someone At Expedia Via Phone And Email's. · Residual Connection · Softmax · Multi-Head Attention · Dense Connections · Layer Normalization · Convolution · Dense Prediction Transformer