Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning
Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Taiji Suzuki, Qingfu Zhang, Hau-San Wong

TL;DR
This paper provides a detailed mathematical analysis showing how transformer models utilize multi-concept word semantics to achieve powerful in-context learning and out-of-distribution generalization, supported by empirical simulations.
Contribution
It introduces a fine-grained theoretical framework connecting multi-concept semantics to ICL capabilities, with exponential convergence analysis in complex transformer training dynamics.
Findings
Transformers leverage multi-concept semantics for effective ICL.
Exponential convergence achieved in training dynamics despite non-convexity.
Empirical results support the theoretical analysis.
Abstract
Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. Existing empirical studies have revealed a strong connection between these LLMs' impressive emergence abilities and their in-context learning (ICL) capacity, allowing them to solve new tasks using only task-specific prompts without further fine-tuning. On the other hand, existing empirical and theoretical studies also show that there is a linear regularity of the multi-concept encoded semantic representation behind transformer-based LLMs. However, existing theoretical work fail to build up an understanding of the connection between this regularity and the innovative power of ICL. Additionally, prior work often focuses on simplified, unrealistic scenarios involving linear transformers or unrealistic loss functions, and they achieve only linear or sub-linear convergence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques
MethodsSoftmax
