Utilizing Pre-trained and Large Language Models for 10-K Items Segmentation

Hsin-Min Lu; Yu-Tai Chien; Huan-Hsun Yen; and Yen-Hsiu Chen

arXiv:2502.08875·q-fin.GN·April 9, 2026

Utilizing Pre-trained and Large Language Models for 10-K Items Segmentation

Hsin-Min Lu, Yu-Tai Chien, Huan-Hsun Yen, and Yen-Hsiu Chen

PDF

TL;DR

This paper introduces two advanced methods, GPT4ItemSeg and BERT4ItemSeg, for segmenting 10-K report items, outperforming traditional approaches and enhancing financial text analytics.

Contribution

It presents novel line-ID prompting and hierarchical models that improve 10-K item segmentation accuracy and adaptability.

Findings

01

BERT4ItemSeg achieves a macro-F1 of 0.9825, surpassing other methods.

02

GPT4ItemSeg adapts easily to regulatory changes.

03

The combined framework offers reliable, reproducible segmentation results.

Abstract

Extracting specific items from 10-K reports is challenging due to variations in document formats and item presentation. To improve over traditional rule-based approaches, this study introduces and compares two advanced item segmentation methods: (1) GPT4ItemSeg, using a novel line-ID-based prompting mechanism to utilize a large language model, ChatGPT-4o, for item segmentation, and (2) BERT4ItemSeg, combining a pre-trained language model, BERT, with a Bi-LSTM model in a hierarchical structure to overcome context window constraints. Trained and evaluated on 3,737 annotated 10-K reports, BERT4ItemSeg achieves a macro-F1 of 0.9825, surpassing GPT4ItemSeg (0.9567), conditional random field (0.9818), and rule-based methods (0.9048) for core items (1, 1A, 3, and 7). These approaches enhance item segmentation performance, improving text analytics in accounting and finance. BERT4ItemSeg offers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.