Empowering Vision Transformers with Multi-Scale Causal Intervention for Long-Tailed Image Classification

Xiaoshuo Yan; Zhaochuan Li; Lei Meng; Zhuang Qi; Wei Wu; Zixuan Li; Xiangxu Meng

arXiv:2505.08173·cs.CV·May 14, 2025

Empowering Vision Transformers with Multi-Scale Causal Intervention for Long-Tailed Image Classification

Xiaoshuo Yan, Zhaochuan Li, Lei Meng, Zhuang Qi, Wei Wu, Zixuan Li, Xiangxu Meng

PDF

TL;DR

This paper introduces TSCNet, a two-stage causal modeling approach that enhances long-tail image classification by addressing biases in Vision Transformers through multi-scale causal interventions.

Contribution

It proposes a novel two-stage causal modeling framework, TSCNet, to improve tail class classification in Vision Transformers by fine-grained causal association discovery and bias calibration.

Findings

01

TSCNet outperforms existing methods on long-tail benchmarks.

02

The hierarchical causal representation learning improves fine-grained feature modeling.

03

Counterfactual bias calibration reduces spurious associations in logits.

Abstract

Causal inference has emerged as a promising approach to mitigate long-tail classification by handling the biases introduced by class imbalance. However, along with the change of advanced backbone models from Convolutional Neural Networks (CNNs) to Visual Transformers (ViT), existing causal models may not achieve an expected performance gain. This paper investigates the influence of existing causal models on CNNs and ViT variants, highlighting that ViT's global feature representation makes it hard for causal methods to model associations between fine-grained features and predictions, which leads to difficulties in classifying tail classes with similar visual appearance. To address these issues, this paper proposes TSCNet, a two-stage causal modeling method to discover fine-grained causal associations through multi-scale causal interventions. Specifically, in the hierarchical causal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.