Shot-Aware Frame Sampling for Video Understanding

Mengyu Zhao; Di Fu; Yongyu Xie; Jiaxing Zhang; Zhigang Yuan; Shirin Jalali; Yong Cao

arXiv:2603.17374·cs.CV·March 19, 2026

Shot-Aware Frame Sampling for Video Understanding

Mengyu Zhao, Di Fu, Yongyu Xie, Jiaxing Zhang, Zhigang Yuan, Shirin Jalali, Yong Cao

PDF

Open Access

TL;DR

This paper introduces InfoShot, a shot-aware frame sampling method that enhances long-video understanding by selecting keyframes to preserve both overall context and critical short events, improving performance without retraining.

Contribution

The paper proposes a novel, task-agnostic shot-aware sampling technique that partitions videos into shots and selects representative frames based on information theory, improving the capture of important short events.

Findings

01

Improves anomaly detection hit rate and Video-QA accuracy under frame constraints.

02

Outperforms strong baselines on standard video understanding benchmarks.

03

Introduces SynFlash, a synthetic benchmark for short-lived event detection.

Abstract

Video frame sampling is essential for efficient long-video understanding with Vision-Language Models (VLMs), since dense inputs are costly and often exceed context limits. Yet when only a small number of frames can be retained, existing samplers often fail to balance broad video coverage with brief but critical events, which can lead to unreliable downstream predictions. To address this issue, we present InfoShot, a task-agnostic, shot-aware frame sampler for long-video understanding. InfoShot first partitions a video into semantically consistent shots, and then selects two complementary keyframes from each shot: one to represent the main content and one to capture unusual within-shot changes. This design is guided by an information-theoretic objective that encourages the sampled set to retain high information about both shot structure and sparse within-shot deviations. In this way, it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning