Notes on Applicability of GPT-4 to Document Understanding
{\L}ukasz Borchmann

TL;DR
This paper evaluates GPT-4 models on document understanding tasks, highlighting their strengths with multimodal inputs and identifying limitations such as performance drops on lengthy documents and potential model contamination.
Contribution
It provides a reproducible benchmark of GPT-4 models for document understanding, emphasizing the importance of multimodal inputs and analyzing model limitations.
Findings
GPT-4 Vision Turbo performs well with OCR and images.
Text-only GPT-4 models face challenges in document comprehension.
Performance drops significantly on lengthy documents.
Abstract
We perform a missing, reproducible evaluation of all publicly available GPT-4 family models concerning the Document Understanding field, where it is frequently required to comprehend text spacial arrangement and visual clues in addition to textual semantics. Benchmark results indicate that though it is hard to achieve satisfactory results with text-only models, GPT-4 Vision Turbo performs well when one provides both text recognized by an external OCR engine and document images on the input. Evaluation is followed by analyses that suggest possible contamination of textual GPT-4 models and indicate the significant performance drop for lengthy documents.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
