IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language   Model

Yatai Ji; Shilong Zhang; Jie Wu; Peize Sun; Weifeng Chen; Xuefeng; Xiao; Sidi Yang; Yujiu Yang; Ping Luo

arXiv:2407.07577·cs.CV·July 11, 2024

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Yatai Ji, Shilong Zhang, Jie Wu, Peize Sun, Weifeng Chen, Xuefeng, Xiao, Sidi Yang, Yujiu Yang, Ping Luo

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces IDA-VLM, a large vision-language model designed to improve understanding of complex visual narratives like movies by recognizing and associating character identities across multiple scenes, and presents a new benchmark MM-ID.

Contribution

The paper proposes a novel ID-aware vision-language model and a benchmark for instance ID recognition, addressing a key limitation in current models for movie understanding.

Findings

01

Existing LVLMs struggle with instance ID recognition across scenes.

02

IDA-VLM improves character identity association in visual narratives.

03

The MM-ID benchmark evaluates multi-identity recognition capabilities.

Abstract

The rapid advancement of Large Vision-Language models (LVLMs) has demonstrated a spectrum of emergent capabilities. Nevertheless, current models only focus on the visual content of a single scenario, while their ability to associate instances across different scenes has not yet been explored, which is essential for understanding complex visual content, such as movies with multiple characters and intricate plots. Towards movie understanding, a critical initial step for LVLMs is to unleash the potential of character identities memory and recognition across multiple visual scenarios. To achieve the goal, we propose visual instruction tuning with ID reference and develop an ID-Aware Large Vision-Language Model, IDA-VLM. Furthermore, our research introduces a novel benchmark MM-ID, to examine LVLMs on instance IDs memory and recognition across four dimensions: matching, location,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiyt17/ida-vlm
pytorchOfficial

Videos

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model· slideslive

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Video Analysis and Summarization

MethodsFocus