Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models
Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji,, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, Nadine Chang,, Karan Sapra, Amala Sanjay Deshmukh, Tuomas Rintamaki, Matthieu Le, Ilia, Karmanov, Lukas Voegtle, Philipp Fischer, De-An Huang

TL;DR
Eagle 2 demonstrates that carefully designed post-training data strategies can significantly enhance open-source vision-language models, achieving state-of-the-art results comparable to larger proprietary models.
Contribution
This work introduces a novel data-centric post-training strategy for VLMs, providing detailed insights and recipes to develop competitive open-source models from scratch.
Findings
Eagle2-9B achieves state-of-the-art results on multiple benchmarks.
The data strategy enables smaller models to match larger proprietary models.
Detailed development process benefits open-source VLM community.
Abstract
Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, showing the key role of data strategy in developing frontier VLMs. By studying and building our post-training data strategy from scratch, we share detailed insights into the development processes, aiming to benefit the development of competitive models for the open-source community. Our introduced data strategy, together with training recipes and model design, leads to a family of performant VLMs named Eagle2. Specifically, Eagle2-9B achieves state-of-the-art results across various multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/llama-nemotron-embed-vl-1b-v2model· 42k dl· ♡ 5042k dl♡ 50
- 🤗nvidia/Eagle2-1Bmodel· 222 dl· ♡ 27222 dl♡ 27
- 🤗nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16model· 90k dl· ♡ 8190k dl♡ 81
- 🤗nvidia/llama-nemotron-rerank-vl-1b-v2model· 47k dl· ♡ 2647k dl♡ 26
- 🤗nvidia/Eagle2-9Bmodel· 85 dl· ♡ 6285 dl♡ 62
- 🤗nvidia/Eagle2-2Bmodel· 155 dl· ♡ 32155 dl♡ 32
- 🤗nvidia/GR00T-N1-2Bmodel· 296 dl· ♡ 350296 dl♡ 350
- 🤗di-zhang-fdu/eagle2-9B-forkedmodel
- 🤗nvidia/GR00T-N1.5-3Bmodel· 4.2k dl· ♡ 1874.2k dl♡ 187
- 🤗nvidia/llama-nemoretriever-colembed-1b-v1model· 95 dl· ♡ 2695 dl♡ 26
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
