DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on   LLM

Xuchen Li; Xiaokun Feng; Shiyu Hu; Meiqi Wu; Dailing Zhang; Jing; Zhang; Kaiqi Huang

arXiv:2405.12139·cs.CV·October 10, 2024

DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM

Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing, Zhang, Kaiqi Huang

PDF

Open Access

TL;DR

This paper introduces DTLLM-VLT, a method that automatically generates multi-granularity textual descriptions to improve visual language tracking, enhancing semantic diversity and evaluation of multi-modal trackers.

Contribution

It presents a novel LLM-based framework for generating diverse, multi-granularity text descriptions to enhance VLT benchmarks and evaluation.

Findings

01

Improved tracking performance with multi-granularity text

02

Seamless integration into various benchmarks

03

Enhanced evaluation of multi-modal trackers

Abstract

Visual Language Tracking (VLT) enhances single object tracking (SOT) by integrating natural language descriptions from a video, for the precise tracking of a specified object. By leveraging high-level semantic information, VLT guides object tracking, alleviating the constraints associated with relying on a visual modality. Nevertheless, most VLT benchmarks are annotated in a single granularity and lack a coherent semantic framework to provide scientific guidance. Moreover, coordinating human annotators for high-quality annotations is laborious and time-consuming. To address these challenges, we introduce DTLLM-VLT, which automatically generates extensive and multi-granularity text to enhance environmental diversity. (1) DTLLM-VLT generates scientific and multi-granularity text descriptions using a cohesive prompt framework. Its succinct and highly adaptable design allows seamless…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Text and Document Classification Technologies · Speech and dialogue systems