On the Comprehensibility of Multi-structured Financial Documents using LLMs and Pre-processing Tools

Shivani Upadhyay; Messiah Ataey; Syed Shariyar Murtaza; Yifan Nie; Jimmy Lin

arXiv:2506.05182·cs.IR·August 22, 2025

On the Comprehensibility of Multi-structured Financial Documents using LLMs and Pre-processing Tools

Shivani Upadhyay, Messiah Ataey, Syed Shariyar Murtaza, Yifan Nie, Jimmy Lin

PDF

Open Access 1 Repo

TL;DR

This paper evaluates how well LLMs and MLLMs understand complex, multi-structured financial documents like PDFs, and shows that pre-processing tools significantly improve their accuracy and reliability.

Contribution

It introduces a pre-processing pipeline that enhances LLM and MLLM comprehension of complex financial data structures, improving accuracy and reducing costs.

Findings

01

GPT-4o achieves 56% accuracy on raw documents.

02

Pre-processing increases GPT-4o accuracy to 61.3%.

03

Pre-processing boosts GPT-4 accuracy to 76%.

Abstract

The proliferation of complex structured data in hybrid sources, such as PDF documents and web pages, presents unique challenges for current Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) in providing accurate answers. Despite the recent advancements of MLLMs, they still often falter when interpreting intricately structured information, such as nested tables and multi-dimensional plots, leading to hallucinations and erroneous outputs. This paper explores the capabilities of LLMs and MLLMs in understanding and answering questions from complex data structures found in PDF documents by leveraging industrial and open-source tools as part of a pre-processing pipeline. Our findings indicate that GPT-4o, a popular MLLM, achieves an accuracy of 56% on multi-structured documents when fed documents directly, and that integrating pre-processing tools raises the accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ogcds/financialqa
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Sentiment Analysis and Opinion Mining · Stock Market Forecasting Methods

MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · GPT-4