Progress-Aware Video Frame Captioning

Zihui Xue; Joungbin An; Xitong Yang; Kristen Grauman

arXiv:2412.02071·cs.CV·March 27, 2025

Progress-Aware Video Frame Captioning

Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman

PDF

Open Access

TL;DR

This paper introduces progress-aware video frame captioning, a new task that generates detailed, temporally precise descriptions for each video frame, capturing action progression and advancing video understanding.

Contribution

We propose ProgressCaptioner, a novel model for frame-level captioning, along with the FrameCap dataset and FrameCapEval benchmark, to improve temporal captioning accuracy.

Findings

01

ProgressCaptioner outperforms existing models in capturing action progression.

02

The FrameCap dataset enables effective training of fine-grained captioning models.

03

Our approach improves keyframe selection and overall video understanding.

Abstract

While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression of actions throughout a video sequence. Despite the strong capabilities of existing leading vision language models, they often struggle to discern the nuances of frame-wise differences. To address this, we propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence. Alongside, we develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality. The results demonstrate that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Vision and Imaging · Human Pose and Action Recognition

MethodsSparse Evolutionary Training