EDGAR-CORPUS: Billions of Tokens Make The World Go Round

Lefteris Loukas; Manos Fergadiotis; Ion Androutsopoulos; Prodromos; Malakasiotis

arXiv:2109.14394·cs.CL·May 31, 2023

EDGAR-CORPUS: Billions of Tokens Make The World Go Round

Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, Prodromos, Malakasiotis

PDF

2 Datasets

TL;DR

EDGAR-CORPUS is the largest financial NLP corpus derived from US annual reports, enabling improved domain-specific embeddings and NLP tasks, with open-source tools for data collection.

Contribution

The paper introduces EDGAR-CORPUS, the largest financial report corpus, along with EDGAR-W2V embeddings and EDGAR-CRAWLER toolkit for data extraction.

Findings

01

EDGAR-W2V embeddings outperform generic GloVe and existing financial embeddings.

02

EDGAR-CORPUS covers over 25 years of US annual reports.

03

Open-source toolkit facilitates future data collection.

Abstract

We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUS is the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format. We use EDGAR-CORPUS to train and release EDGAR-W2V, which are WORD2VEC embeddings for the financial domain. We employ these embeddings in a battery of financial NLP tasks and showcase their superiority over generic GloVe embeddings and other existing financial word embeddings. We also open-source EDGAR-CRAWLER, a toolkit that facilitates downloading and extracting future annual reports.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsGloVe Embeddings