Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization
Yiwen Cao, Yukun Su, Wenjun Wang, Yanxia Liu, Qingyao Wu

TL;DR
This paper introduces a Semantic-Constraint Matching Network using a transformer and a local patch shuffle strategy to improve weakly supervised object localization, achieving state-of-the-art results by addressing divergent activation issues.
Contribution
The paper proposes a novel transformer-based network with a semantic-constraint matching module and a local patch shuffle strategy to enhance object localization accuracy in weakly supervised settings.
Findings
Achieves new state-of-the-art performance on CUB-200-2011 and ILSVRC datasets.
Outperforms previous methods by a large margin.
Effectively mitigates divergent activation in transformer-based WSOL.
Abstract
Weakly supervised object localization (WSOL) strives to learn to localize objects with only image-level supervision. Due to the local receptive fields generated by convolution operations, previous CNN-based methods suffer from partial activation issues, concentrating on the object's discriminative part instead of the entire entity scope. Benefiting from the capability of the self-attention mechanism to acquire long-range feature dependencies, Vision Transformer has been recently applied to alleviate the local activation drawbacks. However, since the transformer lacks the inductive localization bias that are inherent in CNNs, it may cause a divergent activation problem resulting in an uncertain distinction between foreground and background. In this work, we proposed a novel Semantic-Constraint Matching Network (SCMN) via a transformer to converge on the divergent activation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Label Smoothing · Dropout · Absolute Position Encodings · Layer Normalization · Adam
