Hilbert-Guided Sparse Local Attention

Yunge Li; Lanyu Xu

arXiv:2511.05832·cs.CV·February 13, 2026

Hilbert-Guided Sparse Local Attention

Yunge Li, Lanyu Xu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a Hilbert curve-based method to reorder image tokens, significantly improving the efficiency of local attention mechanisms in high-resolution image processing with minimal accuracy loss.

Contribution

It proposes a novel Hilbert curve-based window construction for local attention, enhancing block sparsity and accelerating attention computations in vision transformers.

Findings

01

Achieves approximately 4x speedup for window attention.

02

Achieves approximately 18x speedup for slide attention.

03

Maintains accuracy with minimal loss in end-to-end transformer models.

Abstract

The quadratic compute and memory costs of global self-attention severely limit its use in high-resolution images. Local attention reduces complexity by restricting attention to neighborhoods. Block-sparse kernels can further improve the efficiency of local attention, but conventional local attention patterns often fail to deliver significant speedups because tokens within a window are not contiguous in the 1D sequence. This work proposes a novel method for constructing windows and neighborhoods based on the Hilbert curve. Image tokens are first reordered along a Hilbert curve, and windows and neighborhoods are then formed on the reordered 1D sequence. From a block-sparse perspective, this strategy significantly increases block sparsity and can be combined with existing block-sparse kernels to improve the efficiency of 2D local attention. Experiments show that the proposed Hilbert Window…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. This paper provides a systematic and comprehensive framework for different kinds of local attention patterns. The authors identify a persistent gap between theoretical and practical efficiency of sparse/local attention in vision Transformers, especially when block-sparse kernels are applied to row-major sequence orderings. 2. Although the Hilbert curve is not novel in many computational orders in vision models like Mamba, it is exciting to unify the local attention pattern with it. 3. The

Weaknesses

1. The application field proposed in the paper is limited. Since Hilbert local attention has very promising potential in accelerating local attention in vision models, the authors only show results on image classifications. There should be other tasks, including both understanding and generation, e.g., object detection, semantic segmentation, image generation, etc. On these tasks, more SOTA efficient attention mechanisms should also be carefully discussed and compared.

Reviewer 02Rating 6Confidence 2

Strengths

1. Novelty: The idea of optimizing sequence order for block-sparse kernels is creative and addresses a key system-level bottleneck. 2. Generality: The method is model-agnostic and can be plugged into existing architectures (e.g., Swin, NAT) via programmable interfaces like FlexAttention. 3. Strong Empirical Results: Comprehensive experiments show significant speedups (e.g., 4x for HWA, 18x for HSA) and memory savings. End-to-end models validate practicality.

Weaknesses

1. The methods proposed are indeed interesting, but adaptability to non-square inputs or dynamic resolutions is not discussed. 2. Comparisons are mainly against unoptimized baselines. Deeper comparison with highly optimized kernels is needed.

Reviewer 03Rating 6Confidence 2

Strengths

- The Hilbert curve-based token reordering maintains 2D spatial locality while making tokens in windows/neighborhoods contiguous in the 1D sequence. This increases the ratio of empty blocks (reducing partial blocks) and maximizes the efficiency of block-sparse kernels, solving the core bottleneck of traditional row-major ordered local attention. - Experimental results show significant speedups: HWA outperforms dense window attention by up to 4×, and HSA is 18× faster than conventional slide atte

Weaknesses

This thesis proposes a method for rearranging a sequence of picture patches input to a transformer using Hilbert curves, constructing in a simple way a method that achieves a reduction in the brightness of the attn computation. The paper's experiments are detailed, the narrative is sufficient, and the structure of the lines does not show too many problems. However, from the starting point of the paper, this paper is similar to swin-transformer, both of them carried out technology-based innovatio

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Image Enhancement Techniques · Generative Adversarial Networks and Image Synthesis