Nonparametric Teaching of Attention Learners
Chen Zhang, Jianghui Wang, Bingyang Cheng, Zhongtao Chen, Wendong XU, Cong Wang, Marco Canini, Francesco Orabona, Yik Chung WU, Ngai Wong

TL;DR
This paper introduces Attention Neural Teaching (AtteNT), a nonparametric framework that accelerates training of attention-based neural networks like transformers and ViTs by selective example teaching, reducing training time without sacrificing accuracy.
Contribution
The paper presents a novel nonparametric teaching paradigm for attention learners, providing theoretical insights and demonstrating significant training efficiency improvements.
Findings
Training time reduced by up to 20.58% for ViTs and 13.01% for LLMs.
Training efficiency is improved without loss of accuracy.
Theoretical framework links attention mechanisms with nonparametric teaching principles.
Abstract
Attention learners, neural networks built on the attention mechanism, e.g., transformers, excel at learning the implicit relationships that relate sequences to their corresponding properties, e.g., mapping a given sequence of tokens to the probability of the next token. However, the learning process tends to be costly. To address this, we present a novel paradigm named Attention Neural Teaching (AtteNT) that reinterprets the learning process through a nonparametric teaching perspective. Specifically, the latter provides a theoretical framework for teaching mappings that are implicitly defined (i.e., nonparametric) via example selection. Such an implicit mapping is embodied through a dense set of sequence-property pairs, with the AtteNT teacher selecting a subset to accelerate convergence in attention learner training. By analytically investigating the role of attention on…
Peer Reviews
Decision·ICLR 2026 Poster
1. The primary strength of this paper is the novel and elegant theoretical bridge it builds between attention learning and nonparametric teaching. 2. The theoretical insights are translated into a simple, intuitive, and practical algorithm. The idea of focusing on samples with the highest error (i.e., the hardest examples) is a well-known concept, but this paper provides a nice theoretical justification for it. 3. While the theoretical analysis is limited to a single layer but the authors show t
1. From a practical point, additional to attention layers, the models that are used also include complexities like residual connections, layer normalization, and multiple non-linear blocks. The paper does not address how the theoretical findings generalize from the simple model to these complex ones. 2. The setup assumes a noise-less case. Eq 21 and its use in Algorithm 1 seem to be sensitive to this assumption. It is not clear or discussed how the analysis for nonparametric teaching performs w
1. This paper makes a contribution to accelerate the attention learning, which is an important problem in LLM and CV model learning. 2. The motivation of the AtteNT method has theoretical justification, and the experimental results further validate the effectiveness of the method. 3. The experiments are conducted over multiple LLMs and vision models, ablation study is included.
1. Some experimental settings are unclear to me. Since the paper claims the reduction of training time, which metric is used to compute the training time with and without AtteNT? Specifically, how to decide when to terminate the training for each method? Is the sequence selection time included in the model with AtteNT? 2. In Table 1 and Table 2, adding AtteNT also improves the learning performance. Why AtteNT can lead to such performance improvement is not sufficiently explained in the paper. T
- Interesting combination of non-parametric theory with parametric learning. - The results of the final method seem convincing for the given examples. - Extension of existing method.
- I had a quick look at the code and it is not well-documented, one would have to invest some time to understand how to reproduce their results. - I find the paper generally difficult to parse, but perhaps this is just my lack of background. In particular, Section 4.1 could be improved by giving some intuition and describing under which circumstances the kernel would not be adaptive in terms of $\omega_j$, but would require higher-order importance weights. - For Theorem 3, how important are your
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques
