Constructing a Portfolio Optimization Benchmark Framework for Evaluating Large Language Models
Hanyong Cho, Jang Ho Kim

TL;DR
This paper develops a benchmark framework to evaluate large language models' ability to solve portfolio optimization problems, testing their reasoning in financial decision-making with explicit solutions.
Contribution
It introduces a novel benchmark for assessing LLMs' optimization reasoning in finance, moving beyond traditional language tasks.
Findings
GPT-4 outperforms others in risk-based objectives
Gemini 1.5 Pro excels in return-based tasks
Llama 3.1-70B shows the lowest overall performance
Abstract
This study introduces a benchmark framework for evaluating the financial decision-making capabilities of large language models (LLMs) through portfolio optimization problems with mathematically explicit solutions. Unlike existing financial benchmarks that emphasize language-processing tasks, the proposed framework directly tests optimization-based reasoning in investment contexts. A large set of multiple-choice questions is generated by varying objectives, candidate assets, and investment constraints, with each problem designed to include a unique correct solution and systematically constructed alternatives. Experimental results comparing GPT-4, Gemini 1.5 Pro, and Llama 3.1-70B reveal distinct performance patterns: GPT achieves the highest accuracy in risk-based objectives and remains stable under constraints, Gemini performs well in return-based tasks but struggles under other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStock Market Forecasting Methods · Explainable Artificial Intelligence (XAI) · Financial Distress and Bankruptcy Prediction
