Improving Semantic Segmentation in Transformers using Hierarchical Inter-Level Attention
Gary Leung, Jun Gao, Xiaohui Zeng, Sanja Fidler

TL;DR
This paper introduces Hierarchical Inter-Level Attention (HILA), a novel attention mechanism that enhances transformer-based image segmentation by enabling bidirectional feature updates across different levels, improving boundary localization and semantic understanding.
Contribution
HILA extends hierarchical vision transformers with local inter-level connections, allowing iterative bottom-up and top-down feature updates without altering the base architecture.
Findings
Improves semantic segmentation accuracy on benchmark datasets.
Reduces parameters and FLOPS compared to existing methods.
Easily integrates into popular hierarchical transformer architectures.
Abstract
Existing transformer-based image backbones typically propagate feature information in one direction from lower to higher-levels. This may not be ideal since the localization ability to delineate accurate object boundaries, is most prominent in the lower, high-resolution feature maps, while the semantics that can disambiguate image signals belonging to one object vs. another, typically emerges in a higher level of processing. We present Hierarchical Inter-Level Attention (HILA), an attention-based method that captures Bottom-Up and Top-Down Updates between features of different levels. HILA extends hierarchical vision transformer architectures by adding local connections between features of higher and lower levels to the backbone encoder. In each iteration, we construct a hierarchy by having higher-level features compete for assignments to update lower-level features belonging to them,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Refunds@Expedia|||How do I get a full refund from Expedia? · Position-Wise Feed-Forward Layer · Label Smoothing · Adam · Dropout · Layer Normalization · Convolution
