STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models
Narun Raman, Taylor Lundy, Thiago Amin, Jesse Perla, Kevin, Leyton-Brown

TL;DR
This paper introduces STEER-ME, a comprehensive benchmark for evaluating large language models' microeconomic reasoning across diverse tasks, domains, and perspectives, using an innovative auto-STEER data generation protocol.
Contribution
It presents a novel taxonomy of microeconomic reasoning elements and an automated data generation method to create diverse, unbiased evaluation questions for LLMs.
Findings
Large models outperform smaller ones in microeconomic reasoning.
Prompting strategies significantly affect model performance.
Auto-STEER enables scalable and unbiased benchmark creation.
Abstract
How should one judge whether a given large language model (LLM) can reliably perform economic reasoning? Most existing LLM benchmarks focus on specific applications and fail to present the model with a rich variety of economic tasks. A notable exception is Raman et al. [2024], who offer an approach for comprehensively benchmarking strategic decision-making; however, this approach fails to address the non-strategic settings prevalent in microeconomics, such as supply-and-demand analysis. We address this gap by taxonomizing microeconomic reasoning into distinct elements, focusing on the logic of supply and demand, each grounded in up to distinct domains, perspectives, and types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper offers a highly comprehensive benchmark, breaking down microeconomic reasoning into 57 distinct elements that span a wide range of domains and perspectives. 2. Significant resources and computational effort were invested in the study, with $5,896.33 spent on API requests to OpenAI and Anthropic, and 6.81 GPU years of compute used to evaluate open-source models. 3. All model outputs are made publicly available, promoting reproducibility and enabling further research and contributions
1. The primary contribution—expanding benchmarks to cover non-strategic microeconomic settings—leans more toward an economic contribution than a machine learning one, which may affect the paper's resonance with ICLR’s core audience. 2. The paper asserts that auto-STEER addresses data contamination but lacks empirical evidence or a detailed explanation of how it mitigates this problem effectively. 3. The benchmark is limited to multiple-choice questions, whereas real-world financial assistant LLM
1. The creation of a taxonomy with 57 elements specifically for non-strategic economics provides a strong theoretical basis for the benchmark. This comprehensive taxonomy ensures that the benchmark is grounded in a thorough understanding of microeconomic reasoning, enabling a detailed and structured evaluation of LLMs in this domain. 2. The benchmark is solid, due to its broad variation in testing angles and use of multiple testing metrics.
1. The paper fails to discuss the correlation between the results of the proposed benchmark and those of related benchmarks, highlighting the significance of this research. 2. There are still some issues that need to be addressed in the writing of the paper, such as: - The paper lacks hyperlinks for tables, figures, and sections, making it difficult for readers to locate referenced content. - The title of Figure 2 contains misleading positional descriptors for the two images, referring to them a
1. This work identifies a gap in the evaluation of current LLMs for non-strategic microeconomic reasoning tasks with high practical needs and research value. 2. A structured STEER-ME benchmark and a dynamic data generation protocol auto-STEER are proposed to ensure test diversity and data cleanliness. 3. The experiments cover a wide range of models and adaptation strategies, revealing differences in the performance of different LLMs in non-strategic microeconomic reasoning.
1. In the AUTO-STEER section, the authors may need to clarify which type of LLM was used for data generation. Additionally, it is important to discuss the impacts of the benchmark results by different LLMs for data generation. 2. The authors note that LLMs struggle with basic mathematical problems, like calculating the deadweight loss of a monopoly[Line 445-450], and suggests that this may be attributed to the use of incorrect formulas. To strengthen this claim, the authors should present speci
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training · Focus
