DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai, Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu,, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu,, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang

TL;DR
DeepSeek-VL2 introduces advanced mixture-of-experts vision-language models with dynamic high-resolution image encoding and efficient multi-head latent attention, achieving state-of-the-art multimodal understanding across various tasks.
Contribution
It presents a new series of MoE vision-language models with dynamic image encoding and latent attention mechanisms, improving efficiency and performance over previous models.
Findings
Superior performance on visual question answering and OCR tasks
Achieves state-of-the-art results with fewer parameters
Models are publicly available for research use
Abstract
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗deepseek-ai/deepseek-vl2model· 3.6k dl· ♡ 3793.6k dl♡ 379
- 🤗deepseek-ai/deepseek-vl2-tinymodel· 415k dl· ♡ 245415k dl♡ 245
- 🤗deepseek-ai/deepseek-vl2-smallmodel· 12k dl· ♡ 17712k dl♡ 177
- 🤗wootsi/deepseek-vl2-tiny-automapmodel
- 🤗prince-canuma/deepseek-vl2-smallmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗prince-canuma/deepseek-vl2model· 7 dl· ♡ 27 dl♡ 2
- 🤗prince-canuma/deepseek-vl2-tinymodel· 1 dl1 dl
- 🤗Isotr0py/deepseek-vl2-tinymodel· 33k dl33k dl
- 🤗matsudatkm/deepseek-vl2-tiny-clonemodel· 1 dl1 dl
- 🤗Emova-ollm/deepseek-vl2-deepseekmoe-tinymodel· 4 dl4 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
