Investigating Video Reasoning Capability of Large Language Models with   Tropes in Movies

Hung-Ting Su; Chun-Tong Chao; Ya-Ching Hsu; Xudong Lin; Yulei Niu,; Hung-Yi Lee; Winston H. Hsu

arXiv:2406.10923·cs.CV·June 18, 2024

Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies

Hung-Ting Su, Chun-Tong Chao, Ya-Ching Hsu, Xudong Lin, Yulei Niu,, Hung-Yi Lee, Winston H. Hsu

PDF

Open Access

TL;DR

This paper introduces the Tropes in Movies dataset to evaluate large language models' video reasoning skills, revealing current methods' limitations and proposing enhancements that improve reasoning performance but still fall short of human levels.

Contribution

The paper presents a new dataset and evaluation protocol for video reasoning in LLMs, along with novel methods FEVoRI and ConQueR that improve reasoning accuracy.

Findings

01

Current models only marginally outperform random baselines.

02

Proposed methods improve F1 scores by 15 points.

03

Performance remains below human levels (40 vs. 65 F1).

Abstract

Large Language Models (LLMs) have demonstrated effectiveness not only in language tasks but also in video reasoning. This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills: (1) Abstract Perception: understanding and tokenizing abstract concepts in videos, and (2) Long-range Compositional Reasoning: planning and integrating intermediate reasoning steps for understanding long-range videos with numerous frames. Utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches. Our experiments show that current methods, including Captioner-Reasoner, Large Multimodal Model Instruction Fine-tuning, and Visual Programming, only marginally outperform a random baseline when tackling the challenges of Abstract Perception and Long-range…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)