Leveraging Transformers for Weakly Supervised Object Localization in   Unconstrained Videos

Shakeeb Murtaza; Marco Pedersoli; Aydin Sarraf; Eric Granger

arXiv:2407.06018·cs.CV·July 9, 2024

Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos

Shakeeb Murtaza, Marco Pedersoli, Aydin Sarraf, Eric Granger

PDF

Open Access 1 Repo

TL;DR

This paper introduces TrCAM-V, a transformer-based approach for weakly-supervised object localization in videos, utilizing pseudo-labels from CLIP and CRF loss to improve accuracy without bounding box annotations.

Contribution

The novel TrCAM-V method combines transformer architecture with pseudo-labels and CRF loss for enhanced weakly-supervised video object localization.

Findings

01

Achieves state-of-the-art localization accuracy on YouTube-Objects dataset.

02

Effectively uses pseudo-labels from CLIP for training without bounding boxes.

03

Real-time frame processing demonstrated during inference.

Abstract

Weakly-Supervised Video Object Localization (WSVOL) involves localizing an object in videos using only video-level labels, also referred to as tags. State-of-the-art WSVOL methods like Temporal CAM (TCAM) rely on class activation mapping (CAM) and typically require a pre-trained CNN classifier. However, their localization accuracy is affected by their tendency to minimize the mutual information between different instances of a class and exploit temporal information during training for downstream tasks, e.g., detection and tracking. In the absence of bounding box annotation, it is challenging to exploit precise information about objects from temporal cues because the model struggles to locate objects over time. To address these issues, a novel method called transformer based CAM for videos (TrCAM-V), is proposed for WSVOL. It consists of a DeiT backbone with two heads for classification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shakeebmurtaza/TrCAM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Dense Connections · Contrastive Language-Image Pre-training · Softmax · Feedforward Network · Class-activation map · ALIGN · Linear Layer · Attention Dropout · Dropout