AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

Xilin Jiang; Qiaolin Wang; Junkai Wu; Xiaomin He; Zhongweiyang Xu; Yinghao Ma; Minshuo Piao; Kaiyi Yang; Xiuwen Zheng; Riki Shimizu; Yicong Chen; Arsalan Firoozi; Gavin Mischler; Sukru Samet Dindar; Richard Antonello; Linyang He; Tsun-An Hsieh; Xulin Fan; Yulun Wu; Yuesheng Ma; Chaitanya Amballa; Weixiong Chen; Jiarui Hai; Ruisi Li; Vishal Choudhari; Cong Han; Yinghao Aaron Li; Adeen Flinker; Mounya Elhilali; Emmanouil Benetos; Mark Hasegawa-Johnson; Romit Roy Choudhury; Nima Mesgarani

arXiv:2601.17645·cs.SD·January 27, 2026

AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

Xilin Jiang, Qiaolin Wang, Junkai Wu, Xiaomin He, Zhongweiyang Xu, Yinghao Ma, Minshuo Piao, Kaiyi Yang, Xiuwen Zheng, Riki Shimizu, Yicong Chen, Arsalan Firoozi, Gavin Mischler, Sukru Samet Dindar, Richard Antonello, Linyang He, Tsun-An Hsieh, Xulin Fan, Yulun Wu, Yuesheng Ma

PDF

Open Access 1 Datasets

TL;DR

The AVMeme Exam benchmark assesses multimodal large language models' ability to understand internet audio-visual memes in cultural and contextual terms, revealing current models' limitations in perceiving deeper meaning beyond surface content.

Contribution

This paper introduces AVMeme Exam, a novel benchmark for evaluating multimodal models' understanding of cultural, contextual, and emotional aspects of internet memes across audio and visual modalities.

Findings

01

Models perform poorly on textless music and sound effects.

02

Models struggle with contextual and cultural understanding.

03

Current models show a significant gap in human-aligned multimodal intelligence.

Abstract

Internet audio-visual clips convey meaning through time-varying sound and motion, which extend beyond what text alone can represent. To examine whether AI models can understand such signals in human cultural contexts, we introduce AVMeme Exam, a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects. Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge, along with metadata such as original year, transcript, summary, and sensitivity. We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark. Our results reveal a consistent limitation: current models perform poorly on textless music and sound effects, and struggle to think in context and in culture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

naplab/AVMeme-Exam
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Multisensory perception and integration