Should You Use Your Large Language Model to Explore or Exploit?
Keegan Harris, Aleksandrs Slivkins

TL;DR
This paper systematically evaluates large language models' capabilities in exploration and exploitation tasks, revealing their strengths in reasoning for exploitation and their utility in exploration, but also highlighting their limitations compared to simpler models.
Contribution
It provides a comprehensive analysis of LLMs' performance in exploration and exploitation, introducing a siloed evaluation approach and assessing tool use and in-context summarization effects.
Findings
Reasoning models excel at exploitation tasks but are slow and costly.
Non-reasoning models benefit from tool use and summarization, improving medium-difficulty task performance.
LLMs outperform simple linear regression in exploration of large, semantically rich action spaces.
Abstract
We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff. While previous work has largely study the ability of LLMs to solve combined exploration-exploitation tasks, we take a more systematic approach and use LLMs to explore and exploit in silos in various (contextual) bandit tasks. We find that reasoning models show the most promise for solving exploitation tasks, although they are still too expensive or too slow to be used in many practical settings. Motivated by this, we study tool use and in-context summarization using non-reasoning models. We find that these mitigations may be used to substantially improve performance on medium-difficulty tasks, however even then, all LLMs we study perform worse than a simple linear regression, even in non-linear settings. On the other hand, we find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Network Security and Intrusion Detection
