Examining Two Hop Reasoning Through Information Content Scaling
David Johnston, Nora Belrose

TL;DR
This paper investigates how transformer models learn two-hop reasoning questions, revealing that their capacity to generalize depends on dataset parameters and that small models tend to memorize answers rather than learn reasoning.
Contribution
It demonstrates that transformer capacity scaling affects two-hop question answering and introduces methods to analyze and influence model learning behaviors.
Findings
Transformers often memorize answers rather than learn reasoning.
Capacity scaling supports the need for facts to be learned twice for generalization.
Small models can be trapped in memorization regimes with proper dataset parameters.
Abstract
Prior work has found that transformers have an inconsistent ability to learn to answer latent two-hop questions -- questions of the form "Who is Bob's mother's boss?" We study why this is the case by examining how transformers' capacity to learn datasets of two-hop questions and answers (two-hop QA) scales with their size, motivated by prior work on transformer knowledge capacity for simple factual memorization. We find that capacity scaling and generalization both support the hypothesis that latent two-hop QA requires transformers to learn each fact twice, while two-hop QA with chain of thought does not. We also show that with appropriate dataset parameters, it is possible to "trap" very small models in a regime where they memorize answers to two-hop questions independently, even though they would perform better if they could learn to answer them with function composition. Our findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques
