Two-stage Vision Transformers and Hard Masking offer Robust Object Representations

Ananthu Aniraj; Cassio F. Dantas; Dino Ienco; Diego Marcos

arXiv:2506.08915·cs.CV·April 2, 2026

Two-stage Vision Transformers and Hard Masking offer Robust Object Representations

Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos

PDF

1 Repo

TL;DR

This paper introduces a two-stage vision transformer framework with learned binary masks to improve object recognition robustness by focusing on relevant regions and filtering out background biases.

Contribution

It proposes a novel two-stage attention masking approach that enhances robustness and interpretability in object recognition tasks.

Findings

01

Significant robustness improvements against spurious correlations.

02

Effective filtering of out-of-distribution backgrounds.

03

Enhanced model interpretability through explicit semantic masks.

Abstract

Context can strongly affect object representations, sometimes leading to undesired biases, particularly when objects appear in out-of-distribution backgrounds at inference. At the same time, many object-centric tasks require to leverage the context for identifying the relevant image regions. We posit that this conundrum, in which context is simultaneously needed and a potential nuisance, can be addressed by an attention-based approach that uses learned binary attention masks to ensure that only attended image regions influence the prediction. To test this hypothesis, we evaluate a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, for which context cues are likely to be needed, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ananthu-aniraj/ifam
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.