LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval
Jian Zhang, Junyi Guo, Junyi Yuan, Huanda Lu, Yanlin Zhou, Fangyu Wu, Qiufeng Wang, Dongming Lu

TL;DR
This paper presents $C^3$, a framework that improves cross-modal retrieval of cultural heritage data by enhancing LLM-generated descriptions' completeness and consistency through semantic evaluation and adaptive reasoning supervision.
Contribution
It introduces a novel completeness and consistency evaluation framework for LLM-generated descriptions, improving cross-modal retrieval in cultural heritage datasets.
Findings
$C^3$ achieves state-of-the-art results on cultural heritage datasets.
The framework improves semantic coverage and factual consistency of descriptions.
Enhanced retrieval performance in both fine-tuned and zero-shot settings.
Abstract
Cross-modal retrieval is essential for interpreting cultural heritage data, but its effectiveness is often limited by incomplete or inconsistent textual descriptions, caused by historical data loss and the high cost of expert annotation. While large language models (LLMs) offer a promising solution by enriching textual descriptions, their outputs frequently suffer from hallucinations or miss visually grounded details. To address these challenges, we propose , a data augmentation framework that enhances cross-modal retrieval performance by improving the completeness and consistency of LLM-generated descriptions. introduces a completeness evaluation module to assess semantic coverage using both visual cues and language-model outputs. Furthermore, to mitigate factual inconsistencies, we formulate a Markov Decision Process to supervise Chain-of-Thought reasoning, guiding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
