CoS: Chain-of-Shot Prompting for Long Video Understanding

Jian Hu; Zixu Cheng; Chenyang Si; Wei Li; Shaogang Gong

arXiv:2502.06428·cs.CV·February 12, 2025

CoS: Chain-of-Shot Prompting for Long Video Understanding

Jian Hu, Zixu Cheng, Chenyang Si, Wei Li, Shaogang Gong

PDF

Open Access

TL;DR

This paper introduces Chain-of-Shot prompting (CoS), a method that adaptively selects relevant shots in long videos to improve multi-modal large language model understanding, addressing the challenge of processing lengthy visual content.

Contribution

The paper proposes a novel shot selection framework using test-time prompt optimization, including pseudo temporal grounding and binary coding for task-relevant shot identification.

Findings

01

CoS improves long video understanding across multiple datasets.

02

The method effectively balances relevant and irrelevant shot selection.

03

Experiments show enhanced performance over baseline models.

Abstract

Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image Processing Techniques · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis

MethodsFocus