Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation
Li Zhou, Lutong Yu, Dongchu Xie, Shaohuan Cheng, Wenyan Li, Haizhou Li

TL;DR
Hanfu-Bench is a new multimodal dataset that advances cross-temporal cultural understanding and transcreation of traditional Chinese attire, highlighting challenges for vision-language models in capturing cultural and temporal nuances.
Contribution
Introduces Hanfu-Bench, a multimodal dataset with tasks on cultural visual understanding and transcreation, emphasizing temporal aspects of Chinese culture.
Findings
Closed VLMs match non-experts in understanding but lag behind humans by 10%.
Open VLMs perform worse than non-experts.
Best transcreation model achieves only 42% success rate.
Abstract
Culture is a rich and dynamic domain that evolves across both geography and time. However, existing studies on cultural understanding with vision-language models (VLMs) primarily emphasize geographic diversity, often overlooking the critical temporal dimensions. To bridge this gap, we introduce Hanfu-Bench, a novel, expert-curated multimodal dataset. Hanfu, a traditional garment spanning ancient Chinese dynasties, serves as a representative cultural heritage that reflects the profound temporal aspects of Chinese culture while remaining highly popular in Chinese contemporary society. Hanfu-Bench comprises two core tasks: cultural visual understanding and cultural image transcreation. The former task examines temporal-cultural feature recognition based on single- or multi-image inputs through multiple-choice visual question answering, while the latter focuses on transforming traditional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLanguage, Metaphor, and Cognition
