DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Hengyu Shen; Tiancheng Gu; Bin Qin; Lan Wu; Yuling Wu; Shuo Tan; Zelong Sun; Jun Wang; Nan Wu; Xiang An; Weidong Cai; Ziyong Feng; Kaicheng Yang

arXiv:2601.10305·cs.CV·March 26, 2026

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Hengyu Shen, Tiancheng Gu, Bin Qin, Lan Wu, Yuling Wu, Shuo Tan, Zelong Sun, Jun Wang, Nan Wu, Xiang An, Weidong Cai, Ziyong Feng, Kaicheng Yang

PDF

Open Access 2 Datasets

TL;DR

DanQing is a comprehensive large-scale Chinese vision-language dataset with 100 million high-quality image-text pairs, curated through a systematic pipeline, enabling improved model performance on various downstream tasks and capturing contemporary semantic trends.

Contribution

The paper introduces DanQing, a new large-scale Chinese cross-modal dataset with advanced data curation techniques, addressing the lack of high-quality open-source data for Chinese VLP.

Findings

01

DanQing outperforms existing Chinese datasets in downstream tasks.

02

The dataset captures recent semantic trends from 2024-2025.

03

It exhibits a balanced semantic distribution and better scaling capability.

Abstract

Vision-Language Pre-training (VLP) models have achieved remarkable success by leveraging large-scale image-text pairs. While English-centric models like CLIP and SigLIP benefit from massive datasets (e.g., LAION-400M), the development of Chinese VLP remains bottlenecked by the lack of high-quality, large-scale open-source data. In this paper, we present DanQing, a large-scale Chinese cross-modal dataset containing 100 million high-quality image-text pairs curated from Common Crawl. To ensure superior data quality, we develop an effective systematic pipeline comprising data source selection, text refinement, visual diversification, and cross-modal cross-batch filtering, thereby effectively mitigating the intrinsic noise prevalent in web data. Notably, DanQing incorporates data from 2024-2025, enabling models to capture contemporary semantic trends and emerging concepts. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques