ViLLA: Fine-Grained Vision-Language Representation Learning from   Real-World Data

Maya Varma; Jean-Benoit Delbrouck; Sarah Hooper; Akshay Chaudhari,; Curtis Langlotz

arXiv:2308.11194·cs.CV·August 23, 2023

ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data

Maya Varma, Jean-Benoit Delbrouck, Sarah Hooper, Akshay Chaudhari,, Curtis Langlotz

PDF

Open Access 1 Repo

TL;DR

This paper introduces ViLLA, a novel approach for training vision-language models on complex, real-world data with fine-grained region-attribute relationships, significantly improving fine-grained reasoning tasks.

Contribution

ViLLA is a new framework that decomposes complex image-text pairs into region-attribute pairs and learns from these, addressing the limitations of standard VLMs on complex datasets.

Findings

01

ViLLA outperforms existing VLMs on fine-grained reasoning tasks.

02

Standard VLMs struggle with high pairwise complexity, showing up to 37% performance degradation.

03

ViLLA achieves up to 3.6 AP50 points in zero-shot object detection and 14.2 R-Precision in retrieval.

Abstract

Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained on datasets consisting of image-caption pairs obtained from the web. However, real-world multimodal datasets, such as healthcare data, are significantly more complex: each image (e.g. X-ray) is often paired with text (e.g. physician report) that describes many distinct attributes occurring in fine-grained regions of the image. We refer to these samples as exhibiting high pairwise complexity, since each image-text pair can be decomposed into a large number of region-attribute pairings. The extent to which VLMs can capture fine-grained relationships between image regions and textual attributes when trained on such data has not been previously evaluated. The first key contribution of this work is to demonstrate through systematic evaluations that as the pairwise complexity of the training dataset increases,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stanfordmimi/villa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsALIGN · Contrastive Language-Image Pre-training