TeD-Loc: Text Distillation for Weakly Supervised Object Localization
Shakeeb Murtaza, Soufiane Belharbi, Alexis Guichemerre, Marco Pedersoli, Eric Granger

TL;DR
TeD-Loc introduces a novel method that leverages contrastive alignment from CLIP to improve weakly supervised object localization, achieving higher accuracy and efficiency.
Contribution
It proposes a contrastive knowledge transfer approach from CLIP text embeddings to patch embeddings for improved localization.
Findings
TeD-Loc improves Top-1 Loc by ~5% on CUB and ILSVRC datasets.
It increases PxAP by ~31% on histopathology benchmarks.
TeD-Loc achieves more efficient inference compared to GenPrompt.
Abstract
Weakly supervised object localization (WSOL) models are trained using only image-level class labels. They can predict both the object class and spatial regions corresponding to the object, without requiring explicit bounding box annotations. Given their reliance on classification objectives, traditional WSOL methods, like class activation mapping, tend to focus on the most discriminative object regions, often missing the full spatial extent. Although vision-language models such as CLIP encode rich semantic priors, they are not directly suited for WSOL because global text and class-token embeddings are not explicitly aligned with local patch embeddings, making patch-level localization difficult without additional mechanisms. Recent methods such as GenPrompt address this limitation, but at the cost of increased complexity, as they rely on conditional denoising and elaborate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
