ViMU: Benchmarking Video Metaphorical Understanding

Qi Li; Xinchao Wang

arXiv:2605.14607·cs.CV·May 15, 2026

ViMU: Benchmarking Video Metaphorical Understanding

Qi Li, Xinchao Wang

PDF

1 Repo 2 Datasets

TL;DR

ViMU is a new benchmark for evaluating models' ability to understand metaphorical, ironic, and social subtext in videos beyond literal content recognition.

Contribution

It introduces the first systematic benchmark to assess models' capacity for implicit video understanding grounded in multimodal evidence.

Findings

01

Models struggle with hint-free subtext inference.

02

ViMU enables evaluation of models on implicit, social, and cultural video meanings.

03

Benchmark promotes development of more nuanced video understanding models.

Abstract

Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liqiiiii/Video-Metaphorical-Understanding
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.