UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding
Zhecan Wang, Rui Sun, Haoxuan You, Noel Codella, Kai-Wei Chang,, Shih-Fu Chang

TL;DR
UniFine introduces a unified framework leveraging fine-grained visual and textual information to enhance zero-shot performance across multiple vision-language tasks like VQA, SNLI-VE, and VCR, surpassing previous methods.
Contribution
The paper presents a novel approach that incorporates fine-grained details for zero-shot vision-language understanding, improving performance over global matching strategies.
Findings
Outperforms previous zero-shot methods on VQA.
Achieves substantial improvements on SNLI-VE and VCR.
Ablation studies confirm effectiveness and generalizability.
Abstract
Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
