Large Language Models Still Face Challenges in Multi-Hop Reasoning with External Knowledge
Haotong Zhang

TL;DR
This paper evaluates GPT-3.5's multi-hop reasoning capabilities, revealing significant limitations in knowledge integration, non-sequential reasoning, and scalability, despite high performance on various benchmarks.
Contribution
It systematically assesses large language models' multi-hop reasoning abilities across multiple aspects, highlighting existing challenges and gaps compared to human reasoning.
Findings
GPT-3.5 struggles with multi-hop reasoning tasks
Models have difficulty generalizing to more complex, longer reasoning chains
Significant gap remains between model performance and human reasoning abilities
Abstract
We carry out a series of experiments to test large language models' multi-hop reasoning ability from three aspects: selecting and combining external knowledge, dealing with non-sequential reasoning tasks and generalising to data samples with larger numbers of hops. We test the GPT-3.5 model on four reasoning benchmarks with Chain-of-Thought prompting (and its variations). Our results reveal that despite the amazing performance achieved by large language models on various reasoning tasks, models still suffer from severe drawbacks which shows a large gap with humans.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Attention Dropout · Softmax · Cosine Annealing · Byte Pair Encoding · Linear Layer · Linear Warmup With Cosine Annealing · Multi-Head Attention
