Unleashing the Temporal-Spatial Reasoning Capacity of GPT for   Training-Free Audio and Language Referenced Video Object Segmentation

Shaofei Huang; Rui Ling; Hongyu Li; Tianrui Hui; Zongheng Tang,; Xiaoming Wei; Jizhong Han; Si Liu

arXiv:2408.15876·cs.CV·December 24, 2024

Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang,, Xiaoming Wei, Jizhong Han, Si Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a training-free pipeline that leverages GPT-4 for temporal-spatial reasoning to improve audio and language-referenced video object segmentation, achieving competitive results without model fine-tuning.

Contribution

The novel GPT-assisted pivot selection and language unification modules enable effective training-free video object segmentation using large language models.

Findings

01

Achieves performance comparable or superior to supervised methods.

02

Introduces GPT-PS for high-quality object prompt generation.

03

Unifies AVS and RVOS tasks in a single pipeline.

Abstract

In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT's temporal-spatial reasoning capacity by guiding GPT to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

appletea233/al-ref-sam2
pytorchOfficial

Videos

Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation· underline

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Linear Layer · Adam · Layer Normalization · Weight Decay · Attention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection