Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning

Xu Zhang; Jin Yuan; BinHong Yang; Xuan Liu; Qianjun Zhang; Yuyi Wang; Zhiyong Li; Hanwang Zhang

arXiv:2603.20887·cs.CV·March 24, 2026

Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning

Xu Zhang, Jin Yuan, BinHong Yang, Xuan Liu, Qianjun Zhang, Yuyi Wang, Zhiyong Li, Hanwang Zhang

PDF

Open Access

TL;DR

This paper introduces a novel framework for controllable video segmentation and captioning that allows user prompts to generate precise masks and captions, improving multimodal video understanding.

Contribution

The paper presents SG-FSCFormer, a new model integrating prompt-guided graph reasoning and fine-grained alignment for user-controlled video interpretation.

Findings

01

Achieves state-of-the-art performance on benchmark datasets.

02

Effectively captures and aligns user prompts with video content.

03

Generates high-quality, user-specific multimodal outputs.

Abstract

Recent advancements in multimodal large models have significantly bridged the representation gap between diverse modalities, catalyzing the evolution of video multimodal interpretation, which enhances users' understanding of video content by generating correlated modalities. However, most existing video multimodal interpretation methods primarily concentrate on global comprehension with limited user interaction. To address this, we propose a novel task, Controllable Video Segmentation and Captioning (SegCaptioning), which empowers users to provide specific prompts, such as a bounding box around an object of interest, to simultaneously generate correlated masks and captions that precisely embody user intent. An innovative framework Scene Graph-guided Fine-grained SegCaptioning Transformer (SG-FSCFormer) is designed that integrates a Prompt-guided Temporal Graph Former to effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition