ViMU: Benchmarking Video Metaphorical Understanding
Qi Li, Xinchao Wang

TL;DR
ViMU is a new benchmark for evaluating models' ability to understand metaphorical, ironic, and social subtext in videos beyond literal content recognition.
Contribution
It introduces the first systematic benchmark to assess models' capacity for implicit video understanding grounded in multimodal evidence.
Findings
Models struggle with hint-free subtext inference.
ViMU enables evaluation of models on implicit, social, and cultural video meanings.
Benchmark promotes development of more nuanced video understanding models.
Abstract
Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer's social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
