FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

Yupeng Cao; Haohang Li; Weijin Liu; Wenbo Cao; Anke Xu; Lingfei Qian; Xueqing Peng; Minxue Tang; Zhiyuan Yao; Jimin Huang; K.P. Subbalakshmi; Zining Zhu; Jordan W. Suchow; Yangyang Yu

arXiv:2604.10015·cs.AI·April 16, 2026

FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

Yupeng Cao, Haohang Li, Weijin Liu, Wenbo Cao, Anke Xu, Lingfei Qian, Xueqing Peng, Minxue Tang, Zhiyuan Yao, Jimin Huang, K.P. Subbalakshmi, Zining Zhu, Jordan W. Suchow, Yangyang Yu

PDF

TL;DR

FinTrace introduces a comprehensive benchmark and training dataset for evaluating and improving large language models' ability to perform long-horizon financial tasks through tool calling, emphasizing trajectory-level reasoning.

Contribution

The paper presents FinTrace, a novel trajectory-level benchmark and dataset for financial tool calling, along with fine-tuning methods that enhance intermediate reasoning but highlight challenges in final output quality.

Findings

01

Frontier models excel at tool selection but struggle with information utilization.

02

FinTrace-Training improves intermediate reasoning metrics.

03

DPO training better suppresses failure modes than supervised fine-tuning.

Abstract

Recent studies demonstrate that tool-calling capability enables large language models (LLMs) to interact with external environments for long-horizon financial tasks. While existing benchmarks have begun evaluating financial tool calling, they focus on limited scenarios and rely on call-level metrics that fail to capture trajectory-level reasoning quality. To address this gap, we introduce FinTrace, a benchmark comprising 800 expert-annotated trajectories spanning 34 real-world financial task categories across multiple difficulty levels. FinTrace employs a rubric-based evaluation protocol with nine metrics organized along four axes -- action correctness, execution efficiency, process quality, and output quality -- enabling fine-grained assessment of LLM tool-calling behavior. Our evaluation of 13 LLMs reveals that while frontier models achieve strong tool selection, all models struggle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.