Dissecting Multiplication in Transformers: Insights into LLMs
Luyu Qiu, Jianing Li, Chi Su, Chen Jason Zhang, Lei Chen

TL;DR
This paper analyzes how transformers perform integer multiplication, identifies their limitations in handling carryovers, and proposes improvements that significantly boost accuracy, enhancing interpretability and trust in large language models.
Contribution
The paper provides a detailed analysis of transformers' shortcomings in multiplication and introduces targeted enhancements that improve performance and interpretability.
Findings
Transformers decompose multiplication into parallel subtasks.
Difficulty in calculating carryovers limits performance.
Proposed improvements achieve over 99.9% accuracy on 5-digit multiplication.
Abstract
Transformer-based large language models have achieved remarkable performance across various natural language processing tasks. However, they often struggle with seemingly easy tasks like arithmetic despite their vast capabilities. This stark disparity raise human's concerns about their safe and ethical use, hinder their widespread adoption.In this paper, we focus on a typical arithmetic task, integer multiplication, to explore and explain the imperfection of transformers in this domain. We provide comprehensive analysis of a vanilla transformer trained to perform n-digit integer multiplication. Our observations indicate that the model decomposes multiplication task into multiple parallel subtasks, sequentially optimizing each subtask for each digit to complete the final multiplication. Based on observation and analysis, we infer the reasons of transformers deficiencies in multiplication…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Advancements in Semiconductor Devices and Circuit Design
MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Focus · Label Smoothing · Linear Layer · GPT-4 · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings
