Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos
Shakeeb Murtaza, Marco Pedersoli, Aydin Sarraf, Eric Granger

TL;DR
This paper introduces TrCAM-V, a transformer-based approach for weakly-supervised object localization in videos, utilizing pseudo-labels from CLIP and CRF loss to improve accuracy without bounding box annotations.
Contribution
The novel TrCAM-V method combines transformer architecture with pseudo-labels and CRF loss for enhanced weakly-supervised video object localization.
Findings
Achieves state-of-the-art localization accuracy on YouTube-Objects dataset.
Effectively uses pseudo-labels from CLIP for training without bounding boxes.
Real-time frame processing demonstrated during inference.
Abstract
Weakly-Supervised Video Object Localization (WSVOL) involves localizing an object in videos using only video-level labels, also referred to as tags. State-of-the-art WSVOL methods like Temporal CAM (TCAM) rely on class activation mapping (CAM) and typically require a pre-trained CNN classifier. However, their localization accuracy is affected by their tendency to minimize the mutual information between different instances of a class and exploit temporal information during training for downstream tasks, e.g., detection and tracking. In the absence of bounding box annotation, it is challenging to exploit precise information about objects from temporal cues because the model struggles to locate objects over time. To address these issues, a novel method called transformer based CAM for videos (TrCAM-V), is proposed for WSVOL. It consists of a DeiT backbone with two heads for classification…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
MethodsAttention Is All You Need · Dense Connections · Contrastive Language-Image Pre-training · Softmax · Feedforward Network · Class-activation map · ALIGN · Linear Layer · Attention Dropout · Dropout
