STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

Narun Raman; Taylor Lundy; Thiago Amin; Jesse Perla; Kevin; Leyton-Brown

arXiv:2502.13119·cs.CL·February 20, 2025

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

Narun Raman, Taylor Lundy, Thiago Amin, Jesse Perla, Kevin, Leyton-Brown

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces STEER-ME, a comprehensive benchmark for evaluating large language models' microeconomic reasoning across diverse tasks, domains, and perspectives, using an innovative auto-STEER data generation protocol.

Contribution

It presents a novel taxonomy of microeconomic reasoning elements and an automated data generation method to create diverse, unbiased evaluation questions for LLMs.

Findings

01

Large models outperform smaller ones in microeconomic reasoning.

02

Prompting strategies significantly affect model performance.

03

Auto-STEER enables scalable and unbiased benchmark creation.

Abstract

How should one judge whether a given large language model (LLM) can reliably perform economic reasoning? Most existing LLM benchmarks focus on specific applications and fail to present the model with a rich variety of economic tasks. A notable exception is Raman et al. [2024], who offer an approach for comprehensively benchmarking strategic decision-making; however, this approach fails to address the non-strategic settings prevalent in microeconomics, such as supply-and-demand analysis. We address this gap by taxonomizing microeconomic reasoning into $58$ distinct elements, focusing on the logic of supply and demand, each grounded in up to $10$ distinct domains, $5$ perspectives, and $3$ types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER, which generates a set of questions by adapting…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

1. The paper offers a highly comprehensive benchmark, breaking down microeconomic reasoning into 57 distinct elements that span a wide range of domains and perspectives. 2. Significant resources and computational effort were invested in the study, with $5,896.33 spent on API requests to OpenAI and Anthropic, and 6.81 GPU years of compute used to evaluate open-source models. 3. All model outputs are made publicly available, promoting reproducibility and enabling further research and contributions

Weaknesses

1. The primary contribution—expanding benchmarks to cover non-strategic microeconomic settings—leans more toward an economic contribution than a machine learning one, which may affect the paper's resonance with ICLR’s core audience. 2. The paper asserts that auto-STEER addresses data contamination but lacks empirical evidence or a detailed explanation of how it mitigates this problem effectively. 3. The benchmark is limited to multiple-choice questions, whereas real-world financial assistant LLM

Reviewer 02Rating 6Confidence 3

Strengths

1. The creation of a taxonomy with 57 elements specifically for non-strategic economics provides a strong theoretical basis for the benchmark. This comprehensive taxonomy ensures that the benchmark is grounded in a thorough understanding of microeconomic reasoning, enabling a detailed and structured evaluation of LLMs in this domain. 2. The benchmark is solid, due to its broad variation in testing angles and use of multiple testing metrics.

Weaknesses

1. The paper fails to discuss the correlation between the results of the proposed benchmark and those of related benchmarks, highlighting the significance of this research. 2. There are still some issues that need to be addressed in the writing of the paper, such as: - The paper lacks hyperlinks for tables, figures, and sections, making it difficult for readers to locate referenced content. - The title of Figure 2 contains misleading positional descriptors for the two images, referring to them a

Reviewer 03Rating 6Confidence 3

Strengths

1. This work identifies a gap in the evaluation of current LLMs for non-strategic microeconomic reasoning tasks with high practical needs and research value. 2. A structured STEER-ME benchmark and a dynamic data generation protocol auto-STEER are proposed to ensure test diversity and data cleanliness. 3. The experiments cover a wide range of models and adaptation strategies, revealing differences in the performance of different LLMs in non-strategic microeconomic reasoning.

Weaknesses

1. In the AUTO-STEER section, the authors may need to clarify which type of LLM was used for data generation. Additionally, it is important to discuss the impacts of the benchmark results by different LLMs for data generation. 2. The authors note that LLMs struggle with basic mathematical problems, like calculating the deadweight loss of a monopoly[Line 445-450], and suggests that this may be attributed to the use of incorrect formulas. To strengthen this claim, the authors should present speci

Code & Models

Datasets

narunraman/steer_me
dataset· 25 dl
25 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training · Focus