Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language   Models

Rahul Thapa; Kezhen Chen; Ian Covert; Rahul Chalamala; Ben; Athiwaratkun; Shuaiwen Leon Song; James Zou

arXiv:2406.00977·cs.CV·October 16, 2024·1 cites

Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

Rahul Thapa, Kezhen Chen, Ian Covert, Rahul Chalamala, Ben, Athiwaratkun, Shuaiwen Leon Song, James Zou

PDF

Open Access 1 Repo 5 Models

TL;DR

Dragonfly introduces multi-resolution zoom-in encoding for vision-language models, significantly improving fine-grained image understanding and outperforming larger models across various benchmarks.

Contribution

The paper proposes a novel multi-resolution zoom-in technique for ViTs, enhancing fine-grained detail capture and achieving state-of-the-art results in general and medical vision-language tasks.

Findings

01

Dragonfly outperforms larger models on ten benchmarks.

02

Achieves 91.6% accuracy on SLAKE medical dataset.

03

Sets new state-of-the-art in image captioning tasks.

Abstract

Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we extend recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

togethercomputer/dragonfly
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques