IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Yatai Ji, Shilong Zhang, Jie Wu, Peize Sun, Weifeng Chen, Xuefeng, Xiao, Sidi Yang, Yujiu Yang, Ping Luo

TL;DR
This paper introduces IDA-VLM, a large vision-language model designed to improve understanding of complex visual narratives like movies by recognizing and associating character identities across multiple scenes, and presents a new benchmark MM-ID.
Contribution
The paper proposes a novel ID-aware vision-language model and a benchmark for instance ID recognition, addressing a key limitation in current models for movie understanding.
Findings
Existing LVLMs struggle with instance ID recognition across scenes.
IDA-VLM improves character identity association in visual narratives.
The MM-ID benchmark evaluates multi-identity recognition capabilities.
Abstract
The rapid advancement of Large Vision-Language models (LVLMs) has demonstrated a spectrum of emergent capabilities. Nevertheless, current models only focus on the visual content of a single scenario, while their ability to associate instances across different scenes has not yet been explored, which is essential for understanding complex visual content, such as movies with multiple characters and intricate plots. Towards movie understanding, a critical initial step for LVLMs is to unleash the potential of character identities memory and recognition across multiple visual scenarios. To achieve the goal, we propose visual instruction tuning with ID reference and develop an ID-Aware Large Vision-Language Model, IDA-VLM. Furthermore, our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions: matching, location,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Video Analysis and Summarization
MethodsFocus
