Dynamic Universal Approximation Theory: The Basic Theory for Transformer-based Large Language Models
Wei Wang, Qing Li

TL;DR
This paper develops a theoretical framework based on Universal Approximation Theory to explain the effectiveness, learning capabilities, and optimization strategies of Transformer-based large language models.
Contribution
It introduces a foundational theory that explains why Transformers are effective for LLMs and their abilities like In-Context Learning and pruning.
Findings
Provides a theoretical explanation for Transformer effectiveness
Analyzes the mechanisms behind In-Context Learning in LLMs
Supports the practicality of pruning strategies for LLMs
Abstract
Language models have emerged as a critical area of focus in artificial intelligence, particularly with the introduction of groundbreaking innovations like ChatGPT. Large-scale Transformer networks have quickly become the leading approach for advancing natural language processing algorithms. Built on the Transformer architecture, these models enable interactions that closely mimic human communication and, equipped with extensive knowledge, can even assist in guiding human tasks. Despite their impressive capabilities and growing complexity, a key question remains-the theoretical foundations of large language models (LLMs). What makes Transformer so effective for powering intelligent language applications, such as translation and coding? What underlies LLMs' ability for In-Context Learning (ICL)? How does the LoRA scheme enhance the fine-tuning of LLMs? And what supports the practicality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Layer Normalization · Focus · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Adam
