UniFine: A Unified and Fine-grained Approach for Zero-shot   Vision-Language Understanding

Zhecan Wang; Rui Sun; Haoxuan You; Noel Codella; Kai-Wei Chang,; Shih-Fu Chang

arXiv:2307.00862·cs.CV·April 1, 2025·1 cites

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Zhecan Wang, Rui Sun, Haoxuan You, Noel Codella, Kai-Wei Chang,, Shih-Fu Chang

PDF

Open Access 1 Repo

TL;DR

UniFine introduces a unified framework leveraging fine-grained visual and textual information to enhance zero-shot performance across multiple vision-language tasks like VQA, SNLI-VE, and VCR, surpassing previous methods.

Contribution

The paper presents a novel approach that incorporates fine-grained details for zero-shot vision-language understanding, improving performance over global matching strategies.

Findings

01

Outperforms previous zero-shot methods on VQA.

02

Achieves substantial improvements on SNLI-VE and VCR.

03

Ablation studies confirm effectiveness and generalizability.

Abstract

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

threesr/unifine
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques