Understanding the Effect of Noise in LLM Training Data with Algorithmic Chains of Thought
Alex Havrilla, Maia Iyer

TL;DR
This paper investigates how different types and levels of noise in chain of thought data affect large language models' performance, revealing that models are more sensitive to dynamic noise than static noise, especially during fine-tuning.
Contribution
The study introduces the TInt framework for generating customizable noised traces and provides a detailed analysis of noise impact on LLMs in algorithmically solvable tasks.
Findings
Fine-tuned models are robust to static noise but sensitive to dynamic noise.
Prompted models are more affected by static noise than fine-tuned models.
Removing samples with destructive dynamic noise improves model performance.
Abstract
During both pretraining and fine-tuning, Large Language Models (\textbf{LLMs}) are trained on trillions of tokens of text of widely varying quality. Both phases of training typically involve heuristically filtering out ``low-quality'' or \textit{noisy} training samples, yet little is known quantitatively about how the type or intensity of noise affects downstream performance. In this work, we study how noise in chain of thought (\textbf{CoT}) impacts task performance in the highly-controlled setting of algorithmically solvable tasks. First, we develop the Traced Integer (\textbf{TInt}) framework to generate highly customizable noised execution traces for any arithmetic function on lists of integers. We then define two types of noise: \textit{static} noise, a local form of noise which is applied after the CoT trace is computed, and \textit{dynamic} noise, a global form of noise which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Big Data and Business Intelligence · AI-based Problem Solving and Planning
