Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation
Changcheng Xiao, Qiong Cao, Yujie Zhong, Xiang Zhang, Tao Wang, Canqun, Yang, Long Lan

TL;DR
This paper introduces TenRMOT, a Transformer-based method for referring multi-object tracking and segmentation that leverages multi-stage feature fusion, language-guided queries, and temporal priors, along with a new dataset, Ref-KITTI Segmentation.
Contribution
The paper proposes a novel Transformer-based framework with multi-stage feature fusion and temporal priors for RMOT, and introduces a new challenging dataset for referring multi-object tracking and segmentation.
Findings
TenRMOT outperforms existing methods on RMOT and segmentation tasks.
The new dataset Ref-KITTI Segmentation contains 18 videos with 818 expressions.
Temporal priors improve trajectory consistency in tracking.
Abstract
Referring multi-object tracking (RMOT) is an emerging cross-modal task that aims to locate an arbitrary number of target objects and maintain their identities referred by a language expression in a video. This intricate task involves the reasoning of linguistic and visual modalities, along with the temporal association of target objects. However, the seminal work employs only loose feature fusion and overlooks the utilization of long-term information on tracked objects. In this study, we introduce a compact Transformer-based method, termed TenRMOT. We conduct feature fusion at both encoding and decoding stages to fully exploit the advantages of Transformer architecture. Specifically, we incrementally perform cross-modal fusion layer-by-layer during the encoding phase. In the decoding phase, we utilize language-guided queries to probe memory features for accurate prediction of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Target Tracking and Data Fusion in Sensor Networks · Speech and dialogue systems
MethodsDropout · Layer Normalization · Adam · Attention Is All You Need · Dense Connections · Residual Connection · Position-Wise Feed-Forward Layer · Linear Layer · Byte Pair Encoding · Absolute Position Encodings
