ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data
Maya Varma, Jean-Benoit Delbrouck, Sarah Hooper, Akshay Chaudhari,, Curtis Langlotz

TL;DR
This paper introduces ViLLA, a novel approach for training vision-language models on complex, real-world data with fine-grained region-attribute relationships, significantly improving fine-grained reasoning tasks.
Contribution
ViLLA is a new framework that decomposes complex image-text pairs into region-attribute pairs and learns from these, addressing the limitations of standard VLMs on complex datasets.
Findings
ViLLA outperforms existing VLMs on fine-grained reasoning tasks.
Standard VLMs struggle with high pairwise complexity, showing up to 37% performance degradation.
ViLLA achieves up to 3.6 AP50 points in zero-shot object detection and 14.2 R-Precision in retrieval.
Abstract
Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained on datasets consisting of image-caption pairs obtained from the web. However, real-world multimodal datasets, such as healthcare data, are significantly more complex: each image (e.g. X-ray) is often paired with text (e.g. physician report) that describes many distinct attributes occurring in fine-grained regions of the image. We refer to these samples as exhibiting high pairwise complexity, since each image-text pair can be decomposed into a large number of region-attribute pairings. The extent to which VLMs can capture fine-grained relationships between image regions and textual attributes when trained on such data has not been previously evaluated. The first key contribution of this work is to demonstrate through systematic evaluations that as the pairwise complexity of the training dataset increases,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsALIGN · Contrastive Language-Image Pre-training
