Dynamic Universal Approximation Theory: The Basic Theory for   Transformer-based Large Language Models

Wei Wang; Qing Li

arXiv:2407.00958·cs.AI·December 12, 2024

Dynamic Universal Approximation Theory: The Basic Theory for Transformer-based Large Language Models

Wei Wang, Qing Li

PDF

Open Access

TL;DR

This paper develops a theoretical framework based on Universal Approximation Theory to explain the effectiveness, learning capabilities, and optimization strategies of Transformer-based large language models.

Contribution

It introduces a foundational theory that explains why Transformers are effective for LLMs and their abilities like In-Context Learning and pruning.

Findings

01

Provides a theoretical explanation for Transformer effectiveness

02

Analyzes the mechanisms behind In-Context Learning in LLMs

03

Supports the practicality of pruning strategies for LLMs

Abstract

Language models have emerged as a critical area of focus in artificial intelligence, particularly with the introduction of groundbreaking innovations like ChatGPT. Large-scale Transformer networks have quickly become the leading approach for advancing natural language processing algorithms. Built on the Transformer architecture, these models enable interactions that closely mimic human communication and, equipped with extensive knowledge, can even assist in guiding human tasks. Despite their impressive capabilities and growing complexity, a key question remains-the theoretical foundations of large language models (LLMs). What makes Transformer so effective for powering intelligent language applications, such as translation and coding? What underlies LLMs' ability for In-Context Learning (ICL)? How does the LoRA scheme enhance the fine-tuning of LLMs? And what supports the practicality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Layer Normalization · Focus · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Adam