ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for   Enhanced Alignment in Vision Transformers

Sanchit Sinha; Guangzhi Xiong; Aidong Zhang

arXiv:2501.09221·cs.CV·February 5, 2025

ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for Enhanced Alignment in Vision Transformers

Sanchit Sinha, Guangzhi Xiong, Aidong Zhang

PDF

Open Access

TL;DR

ASCENT-ViT introduces an attention-based, scale-aware concept learning framework for Vision Transformers, enhancing interpretability and predictive accuracy by aligning multiscale features with human-understandable concepts.

Contribution

It proposes a novel scale and position-aware concept learning method that integrates with ViTs, improving interpretability and performance over existing generic explainability modules.

Findings

01

Improves predictive accuracy on multiple datasets

02

Provides accurate, robust concept explanations

03

Enhances interpretability of Vision Transformers

Abstract

As Vision Transformers (ViTs) are increasingly adopted in sensitive vision applications, there is a growing demand for improved interpretability. This has led to efforts to forward-align these models with carefully annotated abstract, human-understandable semantic entities - concepts. Concepts provide global rationales to the model predictions and can be quickly understood/intervened on by domain experts. Most current research focuses on designing model-agnostic, plug-and-play generic concept-based explainability modules that do not incorporate the inner workings of foundation models (e.g., inductive biases, scale invariance, etc.) during training. To alleviate this issue for ViTs, in this paper, we propose ASCENT-ViT, an attention-based, concept learning framework that effectively composes scale and position-aware representations from multiscale feature pyramids and ViT patch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Robotics and Automated Systems

MethodsSoftmax · Attention Is All You Need