Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models
Rahul Thapa, Kezhen Chen, Ian Covert, Rahul Chalamala, Ben, Athiwaratkun, Shuaiwen Leon Song, James Zou

TL;DR
Dragonfly introduces multi-resolution zoom-in encoding for vision-language models, significantly improving fine-grained image understanding and outperforming larger models across various benchmarks.
Contribution
The paper proposes a novel multi-resolution zoom-in technique for ViTs, enhancing fine-grained detail capture and achieving state-of-the-art results in general and medical vision-language tasks.
Findings
Dragonfly outperforms larger models on ten benchmarks.
Achieves 91.6% accuracy on SLAKE medical dataset.
Sets new state-of-the-art in image captioning tasks.
Abstract
Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we extend recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗togethercomputer/Llama-3-8B-Dragonfly-v1model· ♡ 33♡ 33
- 🤗togethercomputer/Llama-3-8B-Dragonfly-Med-v1model· ♡ 22♡ 22
- 🤗SillyTilly/LLama-3-Dragonfly-Medmodel· 4 dl4 dl
- 🤗togethercomputer/Llama-3.1-8B-Dragonfly-v2model· 38 dl· ♡ 138 dl♡ 1
- 🤗togethercomputer/Llama-3.1-8B-Dragonfly-Med-v2model· 25 dl· ♡ 325 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
