Track Anything Rapter(TAR)
Tharun V. Puthanveettil, Fnu Obaid ur Rahman

TL;DR
This paper presents TAR, an advanced UAV tracking system that integrates pre-trained models and multimodal queries for precise object tracking in various scenarios, validated against ground truth and tested with multiple modalities.
Contribution
Develops TAR, a novel UAV tracking system combining pre-trained models and multimodal queries for improved object detection and tracking.
Findings
TAR achieves stable and precise tracking on a custom drone.
The system effectively handles occlusions using foundational models.
Multi-modality support enhances tracking versatility.
Abstract
Object tracking is a fundamental task in computer vision with broad practical applications across various domains, including traffic monitoring, robotics, and autonomous vehicle tracking. In this project, we aim to develop a sophisticated aerial vehicle system known as Track Anything Rapter (TAR), designed to detect, segment, and track objects of interest based on user-provided multimodal queries, such as text, images, and clicks. TAR utilizes cutting-edge pre-trained models like DINO, CLIP, and SAM to estimate the relative pose of the queried object. The tracking problem is approached as a Visual Servoing task, enabling the UAV to consistently focus on the object through advanced motion planning and control algorithms. We showcase how the integration of these foundational models with a custom high-level control algorithm results in a highly stable and precise tracking system deployed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · UAV Applications and Optimization · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Softmax · Layer Normalization · Multi-Head Attention · Dense Connections · Residual Connection · Vision Transformer · Focus · Contrastive Language-Image Pre-training
