Controllable Hybrid Captioner for Improved Long-form Video Understanding
Kuleen Sasse, Efsun Sarioglu Kayi, Arun Reddy

TL;DR
This paper introduces a controllable hybrid video captioning system that combines action and scene descriptions to enhance long-form video understanding and question answering capabilities.
Contribution
It presents a novel controllable captioning framework that integrates static scene descriptions with action captions, improving the quality and completeness of textual video memory.
Findings
Enhanced caption quality with combined action and scene descriptions
Improved question-answering accuracy on long videos
Efficient fine-tuning of captioner for multiple caption types
Abstract
Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques
