TL;DR
PyFi introduces a pyramid-structured dataset and adversarial framework for training vision language models to perform complex financial visual reasoning through progressive question chains.
Contribution
The paper presents PyFi, a scalable, synthesized dataset and adversarial training method enabling VLMs to reason through financial images in a hierarchical manner.
Findings
Fine-tuning improves model accuracy by up to 19.52%.
PyFi-600K dataset enables detailed evaluation of financial visual reasoning.
Adversarial question chains facilitate progressive reasoning capabilities.
Abstract
This paper proposes PyFi, a novel framework for pyramid-like financial image understanding that enables vision language models (VLMs) to reason through question chains in a progressive, simple-to-complex manner. At the core of PyFi is PyFi-600K, a dataset comprising 600K financial question-answer pairs organized into a reasoning pyramid: questions at the base require only basic perception, while those toward the apex demand increasing levels of capability in financial visual understanding and expertise. This data is scalable because it is synthesized without human annotations, using PyFi-adv, a multi-agent adversarial mechanism under the Monte Carlo Tree Search (MCTS) paradigm, in which, for each image, a challenger agent competes with a solver agent by generating question chains that progressively probe deeper capability levels in financial visual reasoning. Leveraging this dataset, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
