BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning
Ching-Yu Chiang, I-Hua Chang, Shih-Wei Liao

TL;DR
This paper introduces a parameter-efficient method using adapters for mobile screenshot captioning, enabling effective model tuning with fewer resources while maintaining performance.
Contribution
It is the first to investigate combining adapters in vision-language models specifically for screenshot captioning, reducing resource requirements.
Findings
Adapter methods achieve comparable performance to full fine-tuning.
Freezing model parameters and tuning only adapters reduces computational costs.
The approach enhances efficiency in mobile screenshot captioning tasks.
Abstract
This study aims to explore efficient tuning methods for the screenshot captioning task. Recently, image captioning has seen significant advancements, but research in captioning tasks for mobile screens remains relatively scarce. Current datasets and use cases describing user behaviors within product screenshots are notably limited. Consequently, we sought to fine-tune pre-existing models for the screenshot captioning task. However, fine-tuning large pre-trained models can be resource-intensive, requiring considerable time, computational power, and storage due to the vast number of parameters in image captioning models. To tackle this challenge, this study proposes a combination of adapter methods, which necessitates tuning only the additional modules on the model. These methods are originally designed for vision or language tasks, and our intention is to apply them to address similar…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsAdapter
