DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu; Wen Liu; Bo Zhang; Bingxuan Wang; Kai Dong; Bo Liu,; Jingxiang Sun; Tongzheng Ren; Zhuoshu Li; Hao Yang; Yaofeng Sun; Chengqi; Deng; Hanwei Xu; Zhenda Xie; Chong Ruan

arXiv:2403.05525·cs.AI·March 12, 2024·45 cites

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu,, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi, Deng, Hanwei Xu, Zhenda Xie, Chong Ruan

PDF

Open Access 1 Repo 10 Models

TL;DR

DeepSeek-VL is a versatile open-source vision-language model designed for real-world applications, emphasizing diverse data, efficient high-resolution processing, and strong language capabilities, achieving state-of-the-art performance.

Contribution

The paper introduces DeepSeek-VL, a new vision-language model with a hybrid encoder, comprehensive real-world data, and an instruction tuning dataset, enhancing practical usability and performance.

Findings

01

Achieves state-of-the-art performance on visual-language benchmarks.

02

Maintains strong language abilities during pretraining.

03

Provides publicly accessible models for community use.

Abstract

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepseek-ai/deepseek-vl
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques