Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution

Junyi Yuan; Jian Zhang; Fangyu Wu; Dongming Lu; Huanda Lu; Qiufeng Wang

arXiv:2505.10921·cs.CV·July 22, 2025

Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution

Junyi Yuan, Jian Zhang, Fangyu Wu, Dongming Lu, Huanda Lu, Qiufeng Wang

PDF

Open Access

TL;DR

This paper introduces CulTi, a specialized dataset for Chinese cultural heritage, and proposes LACLIP, a local alignment method that improves cross-modal retrieval of intricate visual and textual Chinese heritage data.

Contribution

The paper provides the first dedicated Chinese cultural heritage dataset CulTi and develops LACLIP, a novel training-free local alignment strategy for enhanced cross-modal retrieval.

Findings

01

LACLIP outperforms existing models in cross-modal retrieval accuracy.

02

CulTi dataset presents unique challenges due to intricate visual-textual alignment.

03

LACLIP effectively handles fine-grained semantic associations in Chinese heritage data.

Abstract

China has a long and rich history, encompassing a vast cultural heritage that includes diverse multimodal information, such as silk patterns, Dunhuang murals, and their associated historical narratives. Cross-modal retrieval plays a pivotal role in understanding and interpreting Chinese cultural heritage by bridging visual and textual modalities to enable accurate text-to-image and image-to-text retrieval. However, despite the growing interest in multimodal research, there is a lack of specialized datasets dedicated to Chinese cultural heritage, limiting the development and evaluation of cross-modal learning models in this domain. To address this gap, we propose a multimodal dataset named CulTi, which contains 5,726 image-text pairs extracted from two series of professional documents, respectively related to ancient Chinese silk and Dunhuang murals. Compared to existing general-domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques