Temporal-Enhanced Multimodal Transformer for Referring Multi-Object   Tracking and Segmentation

Changcheng Xiao; Qiong Cao; Yujie Zhong; Xiang Zhang; Tao Wang; Canqun; Yang; Long Lan

arXiv:2410.13437·cs.CV·October 18, 2024

Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation

Changcheng Xiao, Qiong Cao, Yujie Zhong, Xiang Zhang, Tao Wang, Canqun, Yang, Long Lan

PDF

Open Access

TL;DR

This paper introduces TenRMOT, a Transformer-based method for referring multi-object tracking and segmentation that leverages multi-stage feature fusion, language-guided queries, and temporal priors, along with a new dataset, Ref-KITTI Segmentation.

Contribution

The paper proposes a novel Transformer-based framework with multi-stage feature fusion and temporal priors for RMOT, and introduces a new challenging dataset for referring multi-object tracking and segmentation.

Findings

01

TenRMOT outperforms existing methods on RMOT and segmentation tasks.

02

The new dataset Ref-KITTI Segmentation contains 18 videos with 818 expressions.

03

Temporal priors improve trajectory consistency in tracking.

Abstract

Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to locate an arbitrary number of target objects and maintain their identities referred by a language expression in a video. This intricate task involves the reasoning of linguistic and visual modalities, along with the temporal association of target objects. However, the seminal work employs only loose feature fusion and overlooks the utilization of long-term information on tracked objects. In this study, we introduce a compact Transformer-based method, termed TenRMOT. We conduct feature fusion at both encoding and decoding stages to fully exploit the advantages of Transformer architecture. Specifically, we incrementally perform cross-modal fusion layer-by-layer during the encoding phase. In the decoding phase, we utilize language-guided queries to probe memory features for accurate prediction of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Target Tracking and Data Fusion in Sensor Networks · Speech and dialogue systems

MethodsDropout · Layer Normalization · Adam · Attention Is All You Need · Dense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Linear Layer · Byte Pair Encoding · Absolute Position Encodings