TL;DR
This paper introduces a two-stage vision transformer framework with learned binary masks to improve object recognition robustness by focusing on relevant regions and filtering out background biases.
Contribution
It proposes a novel two-stage attention masking approach that enhances robustness and interpretability in object recognition tasks.
Findings
Significant robustness improvements against spurious correlations.
Effective filtering of out-of-distribution backgrounds.
Enhanced model interpretability through explicit semantic masks.
Abstract
Context can strongly affect object representations, sometimes leading to undesired biases, particularly when objects appear in out-of-distribution backgrounds at inference. At the same time, many object-centric tasks require to leverage the context for identifying the relevant image regions. We posit that this conundrum, in which context is simultaneously needed and a potential nuisance, can be addressed by an attention-based approach that uses learned binary attention masks to ensure that only attended image regions influence the prediction. To test this hypothesis, we evaluate a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, for which context cues are likely to be needed, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
