FAR: Function-preserving Attention Replacement for IMC-friendly Inference
Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

TL;DR
FAR introduces a function-preserving attention replacement for transformers, enabling efficient IMC-compatible inference with minimal accuracy loss and reduced latency.
Contribution
It proposes a novel attention replacement framework that retains model performance while optimizing for in-memory computing hardware.
Findings
FAR maintains comparable accuracy to original models on ImageNet.
FAR reduces model parameters and latency significantly.
Structured pruning enables resource adaptation without accuracy loss.
Abstract
While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Neural Network Applications · Advanced Memory and Neural Computing
MethodsLinear Layer · Softmax · Multi-Head Attention · Dropout · Attention Is All You Need · Residual Connection · Layer Normalization · Dense Connections · Vision Transformer · Pruning
