DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu,, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi, Deng, Hanwei Xu, Zhenda Xie, Chong Ruan

TL;DR
DeepSeek-VL is a versatile open-source vision-language model designed for real-world applications, emphasizing diverse data, efficient high-resolution processing, and strong language capabilities, achieving state-of-the-art performance.
Contribution
The paper introduces DeepSeek-VL, a new vision-language model with a hybrid encoder, comprehensive real-world data, and an instruction tuning dataset, enhancing practical usability and performance.
Findings
Achieves state-of-the-art performance on visual-language benchmarks.
Maintains strong language abilities during pretraining.
Provides publicly accessible models for community use.
Abstract
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗deepseek-ai/deepseek-vl-7b-chatmodel· 7.7k dl· ♡ 2707.7k dl♡ 270
- 🤗deepseek-ai/deepseek-vl-1.3b-chatmodel· 7.1k dl· ♡ 707.1k dl♡ 70
- 🤗deepseek-ai/deepseek-vl-1.3b-basemodel· 348 dl· ♡ 56348 dl♡ 56
- 🤗deepseek-ai/deepseek-vl-7b-basemodel· 73 dl· ♡ 6473 dl♡ 64
- 🤗deepseek-community/deepseek-vl-1.3b-chatmodel· 1.8k dl· ♡ 21.8k dl♡ 2
- 🤗deepseek-community/deepseek-vl-1.3b-basemodel· 18 dl18 dl
- 🤗deepseek-community/deepseek-vl-7b-chatmodel· 1.8k dl· ♡ 11.8k dl♡ 1
- 🤗deepseek-community/deepseek-vl-7b-basemodel· 15 dl15 dl
- 🤗i99om/jointomybabaswymodel
- 🤗Simt123/deepseek-vl-7b-chatmodel· 17 dl17 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
