TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement
Arian Sabaghi, Jos\'e Oramas

TL;DR
TriLite introduces a single-stage, parameter-efficient weakly supervised object localization framework leveraging a frozen Vision Transformer with a novel TriHead module for improved object coverage and state-of-the-art results.
Contribution
It proposes TriLite, a novel WSOL method using minimal trainable parameters and a disentanglement approach to improve object localization without extensive fine-tuning.
Findings
Sets new state-of-the-art on multiple datasets
Uses fewer than 800K trainable parameters
Easier to train than prior WSOL methods
Abstract
Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Face recognition and analysis
