EDGAR-CORPUS: Billions of Tokens Make The World Go Round
Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, Prodromos, Malakasiotis

TL;DR
EDGAR-CORPUS is the largest financial NLP corpus derived from US annual reports, enabling improved domain-specific embeddings and NLP tasks, with open-source tools for data collection.
Contribution
The paper introduces EDGAR-CORPUS, the largest financial report corpus, along with EDGAR-W2V embeddings and EDGAR-CRAWLER toolkit for data extraction.
Findings
EDGAR-W2V embeddings outperform generic GloVe and existing financial embeddings.
EDGAR-CORPUS covers over 25 years of US annual reports.
Open-source toolkit facilitates future data collection.
Abstract
We release EDGAR-CORPUS, a novel corpus comprising annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. To the best of our knowledge, EDGAR-CORPUS is the largest financial NLP corpus available to date. All the reports are downloaded, split into their corresponding items (sections), and provided in a clean, easy-to-use JSON format. We use EDGAR-CORPUS to train and release EDGAR-W2V, which are WORD2VEC embeddings for the financial domain. We employ these embeddings in a battery of financial NLP tasks and showcase their superiority over generic GloVe embeddings and other existing financial word embeddings. We also open-source EDGAR-CRAWLER, a toolkit that facilitates downloading and extracting future annual reports.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsGloVe Embeddings
