QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

Hyunwoo Oh; Hanning Chen; Sanggeon Yun; Yang Ni; Wenjun Huang; Tamoghno Das; Suyeon Jang; Mohsen Imani

arXiv:2511.13679·cs.AR·November 18, 2025

QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

Hyunwoo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Wenjun Huang, Tamoghno Das, Suyeon Jang, Mohsen Imani

PDF

Open Access

TL;DR

QUILL is a specialized hardware accelerator that optimizes deformable transformer computations by converting irregular memory access patterns into cache-friendly, single-pass operations, significantly improving throughput and energy efficiency.

Contribution

It introduces a novel schedule-aware architecture with DOOQ and fused engine to efficiently execute deformable attention on hardware, outperforming existing accelerators.

Findings

01

Up to 7.29x higher throughput than RTX 4090

02

47.3x better energy efficiency

03

Maintains accuracy within 0.9 AP with quantization

Abstract

Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer--forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W''m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Network Packet Processing and Optimization · Cloud Computing and Resource Management