IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

Yujia Liang; Jile Jiao; Xuetao Feng; Zixuan Ye; Yuan Wang; Zhicheng Wang

arXiv:2506.21116·cs.CV·July 9, 2025

IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

Yujia Liang, Jile Jiao, Xuetao Feng, Zixuan Ye, Yuan Wang, Zhicheng Wang

PDF

Open Access

TL;DR

This paper introduces IPFormer-VideoLLM, a new model with instance-level prompts and a dataset, MultiClip-Bench, to improve multi-shot video understanding, addressing challenges like scene changes and camera angles.

Contribution

The work presents a novel dataset for multi-shot scenarios and a new model that effectively encodes instance-specific information across scenes.

Findings

01

Enhanced multi-shot performance on new dataset

02

Improved accuracy across various video benchmarks

03

Effective encoding of instance-level features

Abstract

Video Large Language Models (VideoLLMs) have demonstrated remarkable understanding capabilities, but are found struggling to tackle multi-shot scenarios,e.g., video clips with varying camera angles or scene changes. This challenge can render failures such as instance identity forgetting and key frame negligence. In this work, we first attribute the challenge to the lack of multi-shot annotations among existing datasets and therefore we introduce a new dataset termed MultiClip-Bench, featuring dense descriptions and instruction-based question-answering pairs tailored for multi-shot scenarios. We empirically find that the training set significantly boosts the multi-shot performance, while the testing benchmark provides a reliable measure of the model capability in multi-shot scenarios. By further analyzing and discovering that current models only encode instance features in a discrete or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training