IntentVCNet: Bridging Spatio-Temporal Gaps for Intention-Oriented Controllable Video Captioning

Tianheng Qiu; Jingchun Gao; Jingyu Li; Huiyi Leong; Xuan Huang; Xi Wang; Xiaocheng Zhang; Kele Xu; Lan Zhang

arXiv:2507.18531·cs.CV·July 25, 2025

IntentVCNet: Bridging Spatio-Temporal Gaps for Intention-Oriented Controllable Video Captioning

Tianheng Qiu, Jingchun Gao, Jingyu Li, Huiyi Leong, Xuan Huang, Xi Wang, Xiaocheng Zhang, Kele Xu, Lan Zhang

PDF

Open Access

TL;DR

IntentVCNet enhances large visual language models to achieve fine-grained, intent-oriented video captioning by bridging spatial and temporal understanding gaps through prompt strategies and visual context augmentation.

Contribution

The paper introduces IntentVCNet, a novel approach that unifies spatial and temporal understanding in LVLMs for controlled video captioning, addressing the spatio-temporal gap.

Findings

01

Achieved state-of-the-art results on open source LVLMs.

02

Facilitated accurate generation of intent-oriented video captions.

03

Runner-up in the IntentVC challenge.

Abstract

Intent-oriented controlled video captioning aims to generate targeted descriptions for specific targets in a video based on customized user intent. Current Large Visual Language Models (LVLMs) have gained strong instruction following and visual comprehension capabilities. Although the LVLMs demonstrated proficiency in spatial and temporal understanding respectively, it was not able to perform fine-grained spatial control in time sequences in direct response to instructions. This substantial spatio-temporal gap complicates efforts to achieve fine-grained intention-oriented control in video. Towards this end, we propose a novel IntentVCNet that unifies the temporal and spatial understanding knowledge inherent in LVLMs to bridge the spatio-temporal gap from both prompting and model perspectives. Specifically, we first propose a prompt combination strategy designed to enable LLM to model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Vision and Imaging