Declaration-based Prompt Tuning for Visual Question Answering
Yuhang Liu, Wei Wei, Daowan Peng, Feida Zhu

TL;DR
This paper introduces Declaration-based Prompt Tuning (DPT), a novel approach that unifies pre-training and fine-tuning objectives for VQA, significantly improving accuracy and generalization, especially in low-data scenarios.
Contribution
DPT reformulates VQA as a prompt-tuning task with joint optimization of pre-training and fine-tuning objectives, enhancing model adaptation and performance.
Findings
DPT outperforms traditional fine-tuning by 2.68% in accuracy on GQA.
DPT achieves over 31% improvement in zero-shot and few-shot settings.
Experimental results demonstrate DPT's effectiveness in both fully-supervised and low-data scenarios.
Abstract
In recent years, the pre-training-then-fine-tuning paradigm has yielded immense success on a wide spectrum of cross-modal tasks, such as visual question answering (VQA), in which a visual-language (VL) model is first optimized via self-supervised task objectives, e.g., masked language modeling (MLM) and image-text matching (ITM), and then fine-tuned to adapt to downstream task (e.g., VQA) via a brand-new objective function, e.g., answer prediction. The inconsistency of the objective forms not only severely limits the generalization of pre-trained VL models to downstream tasks, but also requires a large amount of labeled data for fine-tuning. To alleviate the problem, we propose an innovative VL fine-tuning paradigm (named Declaration-based Prompt Tuning, abbreviated as DPT), which jointly optimizes the objectives of pre-training and fine-tuning of VQA model, boosting the effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Six Ways To Communicate To Someone At Expedia Via Phone And Email's. · Residual Connection · Softmax · Multi-Head Attention · Dense Connections · Layer Normalization · Convolution · Dense Prediction Transformer
