Time Series Augmented Generation for Financial Applications
Anton Kolonin, Alexey Glushchenko, Evgeny Bochkov, Abhishek Saxena

TL;DR
This paper introduces a new evaluation framework and benchmark for assessing large language models' reasoning in financial time-series analysis, emphasizing tool use accuracy and reliability.
Contribution
It presents a novel methodology and benchmark for measuring LLM reasoning in finance, along with empirical insights from large-scale experiments.
Findings
Capable agents achieve near-perfect tool-use accuracy.
Minimal hallucination observed in high-performing agents.
Benchmark and framework are publicly released for research use.
Abstract
Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
