Form 10-Q Itemization

Yanci Zhang; Tianming Du; Yujie Sun; Lawrence Donohue; Rui Dai

arXiv:2104.11783·cs.IR·October 22, 2021

Form 10-Q Itemization

Yanci Zhang, Tianming Du, Yujie Sun, Lawrence Donohue, Rui Dai

PDF

TL;DR

This paper introduces a hybrid approach combining rule-based methods and CNN image classification to effectively itemize and extract structured information from unstructured 10-Q filings, facilitating improved data retrieval and NLP applications.

Contribution

It presents a novel pipeline that leverages typographic features and CNNs to automate itemization of 10-Q filings, addressing challenges of unstructured financial documents.

Findings

01

Effective itemization of 10-Q filings demonstrated

02

Pipeline enables rapid data retrieval from textual data

03

Extracted data supports training transformer-based NLP models

Abstract

The quarterly financial statement, or Form 10-Q, is one of the most frequently required filings for US public companies to disclose financial and other important business information. Due to the massive volume of 10-Q filings and the enormous variations in the reporting format, it has been a long-standing challenge to retrieve item-specific information from 10-Q filings that lack machine-readable hierarchy. This paper presents a solution for itemizing 10-Q files by complementing a rule-based algorithm with a Convolutional Neural Network (CNN) image classifier. This solution demonstrates a pipeline that can be generalized to a rapid data retrieval solution among a large volume of textual data using only typographic items. The extracted textual data can be used as unlabeled content-specific data to train transformer models (e.g., BERT) or fit into various field-focus natural language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.