Reasoning Capabilities of Large Language Models on Dynamic Tasks
Annie Wong, Thomas B\"ack, Aske Plaat, Niki van Stein, Anna V. Kononova

TL;DR
This paper evaluates large language models' reasoning abilities in dynamic tasks, revealing performance gaps, the impact of prompting strategies, and persistent limitations compared to human reasoning.
Contribution
It systematically assesses prompting strategies on dynamic tasks, highlighting their effects and the ongoing challenges in achieving human-like reasoning in large language models.
Findings
Larger models generally outperform smaller ones.
Strategic prompting can narrow performance gaps.
Advanced prompting benefits smaller models more on complex tasks.
Abstract
Large language models excel on static benchmarks, but their ability as self-learning agents in dynamic environments remains unclear. We evaluate three prompting strategies: self-reflection, heuristic mutation, and planning across dynamic tasks with open-source models. We find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, an overly long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Language and cultural evolution
MethodsSelf-Learning
