Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski

TL;DR
This paper investigates how visual cropping, both human and automatic, can improve the performance of BLIP-family models on fine-detail visual question answering, especially in zero-shot settings, by focusing on relevant image regions.
Contribution
The study introduces automatic cropping strategies based on multi-modal embeddings and demonstrates their effectiveness in enhancing BLIP models' accuracy on fine-detail questions.
Findings
Cropping improves BLIP model performance significantly.
Automatic cropping methods are comparable to human cropping.
Performance gains are more notable in zero-shot models and with smaller bounding boxes.
Abstract
Visual Question Answering is a challenging task, as it requires seamless interaction between perceptual, linguistic, and background knowledge systems. While the recent progress of visual and natural language models like BLIP has led to improved performance on this task, we lack understanding of the ability of such models to perform on different kinds of questions and reasoning types. As our initial analysis of BLIP-family models revealed difficulty with answering fine-detail questions, we investigate the following question: Can visual cropping be employed to improve the performance of state-of-the-art visual question answering models on fine-detail questions? Given the recent success of the BLIP-family models, we study a zero-shot and a fine-tuned BLIP model. We define three controlled subsets of the popular VQA-v2 benchmark to measure whether cropping can help model performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training · BLIP: Bootstrapping Language-Image Pre-training
