On Understanding Attention-Based In-Context Learning for Categorical Data
Aaron T. Wang, William Convertino, Xiang Cheng, Ricardo Henao, and Lawrence Carin

TL;DR
This paper analyzes attention-based in-context learning for categorical data, presenting a neural network model that performs exact multi-step functional gradient descent inference, supported by theoretical analysis and empirical validation across various tasks.
Contribution
It introduces a novel attention-based network architecture capable of exact multi-step functional gradient descent inference for categorical data, with comprehensive theoretical and empirical validation.
Findings
The model performs exact multi-step inference for categorical data.
Theoretical analysis generalizes prior assumptions on attention mechanisms.
Empirical results demonstrate effectiveness on synthetic, image, and language tasks.
Abstract
In-context learning based on attention models is examined for data with categorical outcomes, with inference in such models viewed from the perspective of functional gradient descent (GD). We develop a network composed of attention blocks, with each block employing a self-attention layer followed by a cross-attention layer, with associated skip connections. This model can exactly perform multi-step functional GD inference for in-context inference with categorical observations. We perform a theoretical analysis of this setup, generalizing many prior assumptions in this line of work, including the class of attention mechanisms for which it is appropriate. We demonstrate the framework empirically on synthetic data, image classification and language generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Machine Learning and Data Classification
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout
