Utilizing Pre-trained and Large Language Models for 10-K Items Segmentation
Hsin-Min Lu, Yu-Tai Chien, Huan-Hsun Yen, and Yen-Hsiu Chen

TL;DR
This paper introduces two advanced methods, GPT4ItemSeg and BERT4ItemSeg, for segmenting 10-K report items, outperforming traditional approaches and enhancing financial text analytics.
Contribution
It presents novel line-ID prompting and hierarchical models that improve 10-K item segmentation accuracy and adaptability.
Findings
BERT4ItemSeg achieves a macro-F1 of 0.9825, surpassing other methods.
GPT4ItemSeg adapts easily to regulatory changes.
The combined framework offers reliable, reproducible segmentation results.
Abstract
Extracting specific items from 10-K reports is challenging due to variations in document formats and item presentation. To improve over traditional rule-based approaches, this study introduces and compares two advanced item segmentation methods: (1) GPT4ItemSeg, using a novel line-ID-based prompting mechanism to utilize a large language model, ChatGPT-4o, for item segmentation, and (2) BERT4ItemSeg, combining a pre-trained language model, BERT, with a Bi-LSTM model in a hierarchical structure to overcome context window constraints. Trained and evaluated on 3,737 annotated 10-K reports, BERT4ItemSeg achieves a macro-F1 of 0.9825, surpassing GPT4ItemSeg (0.9567), conditional random field (0.9818), and rule-based methods (0.9048) for core items (1, 1A, 3, and 7). These approaches enhance item segmentation performance, improving text analytics in accounting and finance. BERT4ItemSeg offers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
