LUCID: Attention with Preconditioned Representations

Sai Surya Duvvuri; Nirmal Patel; Nilesh Gupta; Inderjit S. Dhillon

arXiv:2602.10410·cs.LG·February 12, 2026

LUCID: Attention with Preconditioned Representations

Sai Surya Duvvuri, Nirmal Patel, Nilesh Gupta, Inderjit S. Dhillon

PDF

Open Access 1 Models 3 Reviews

TL;DR

LUCID introduces a preconditioned attention mechanism that improves focus on relevant tokens in long sequences, overcoming softmax limitations and enhancing performance in long-context retrieval tasks.

Contribution

The paper proposes LUCID Attention, a novel architectural modification applying a preconditioner to improve focus in long-sequence attention without increasing computational complexity.

Findings

01

Up to 18% improvement on BABILong retrieval tasks

02

Achieves better focus on relevant tokens in long sequences

03

Validates effectiveness on models with ~1 billion parameters

Abstract

Softmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass to irrelevant tokens degrading performance in long-sequence scenarios. Furthermore, attempts to sharpen focus by lowering softmax temperature hinder learnability due to vanishing gradients. We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately with same computational complexity as standard attention. Additionally,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 5

Strengths

1. The paper presents a well-motivated and theoretically grounded contribution. It provides a clear diagnosis of the limitations of softmax attention: its diffuse probability distributions and key correlations. This paper also introduces LUCID through a clean, RKHS-based derivation. 2. The proposed preconditioning step is mathematically elegant and interpretable. It functions as a decorrelation operator that enhances retrieval precision and training stability without adding extra parameters. 3

Weaknesses

1. Evaluation focuses heavily on NIAH benchmarks and lacks large-scale or real-world long-document tasks (e.g., book summarization, multi-hop QA). Broader validation would strengthen claims of generalization. 2. The paper evaluates LUCID against relatively few baselines, leaving open questions about its comparative advantages over a broader range of recent attention mechanisms.

Reviewer 02Rating 4Confidence 4

Strengths

- The idea of using a key-key similarity preconditioner to decorrelate keys in a Reproducing Kernel Hilbert Space (RKHS) is novel and theoretically motivated. - The method is presented as a drop-in replacement for standard attention, requiring no additional parameters and maintaining \(\mathcal{O}(N^2 d)\) complexity. - Empirical results on SNIAH and MNIAH tasks demonstrate improved retrieval performance over standard attention and some existing variants.

Weaknesses

1. **Limited Empirical Validation of Core Claims** The paper claims that LUCID improves focus and reduces attentional noise, but no direct evidence is provided (e.g., visualization of attention maps or quantitative analysis of attention sparsity). Similarly, the claim that LUCID mitigates gradient vanishing is not empirically verified. It would be valuable to: - Visualize attention distributions for LUCID vs. standard attention on long-context examples. - Include gradient norm analysi

Reviewer 03Rating 4Confidence 4

Strengths

The proposed method is novel and consistently outperform baselines across synthetic benchmarks.

Weaknesses

1. Currently benchmark are mostly synthetic. Could authors compare their methods with baselines on real-world NLP long-context modeling benchmarks, e.g. LongBench [1] ? 2. How is the matrix inversion efficiently implemented? Could authors compare the wall time against softmax attention? 3. What is the motivation of assuming Q=K in the theoretical part? Why we need to find a P that multiplied to the attention score and produce an identity matrix? [1]. LongBench: A Bilingual, Multitask Benchmark

Code & Models

Models

🤗
KitsuVp/NeoLLM
model· 2.9k dl· ♡ 1
2.9k dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Ferroelectric and Negative Capacitance Devices