Controllable Hybrid Captioner for Improved Long-form Video Understanding

Kuleen Sasse; Efsun Sarioglu Kayi; Arun Reddy

arXiv:2507.17047·cs.CV·November 11, 2025

Controllable Hybrid Captioner for Improved Long-form Video Understanding

Kuleen Sasse, Efsun Sarioglu Kayi, Arun Reddy

PDF

Open Access

TL;DR

This paper introduces a controllable hybrid video captioning system that combines action and scene descriptions to enhance long-form video understanding and question answering capabilities.

Contribution

It presents a novel controllable captioning framework that integrates static scene descriptions with action captions, improving the quality and completeness of textual video memory.

Findings

01

Enhanced caption quality with combined action and scene descriptions

02

Improved question-answering accuracy on long videos

03

Efficient fine-tuning of captioner for multiple caption types

Abstract

Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques