MixFormer: End-to-End Tracking with Iterative Mixed Attention
Yutao Cui, Cheng Jiang, Gangshan Wu, Limin Wang

TL;DR
MixFormer introduces a unified transformer-based framework for visual object tracking that employs a novel Mixed Attention Module for simultaneous feature extraction and target information integration, achieving state-of-the-art results.
Contribution
The paper proposes a novel Mixed Attention Module (MAM) within a transformer-based framework, unifying feature extraction and target integration for improved tracking performance.
Findings
Achieved new state-of-the-art results on seven tracking benchmarks.
Demonstrated effectiveness of pre-training methods, including masked pre-training.
Developed an efficient asymmetric attention scheme for multiple target templates.
Abstract
Visual object tracking often employs a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, in this paper, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer trackers simply by stacking multiple MAMs and placing a localization head on top. Specifically, we instantiate two types of MixFormer trackers, a hierarchical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
MethodsArtemisinin Optimization based on Malaria Therapy: Algorithm and Applications to Medical Image Segmentation
