AMALGUM -- A Free, Balanced, Multilayer English Web Corpus
Luke Gessler, Siyao Peng, Yang Liu, Yilun Zhu, Shabnam Behzad, Amir, Zeldes

TL;DR
AMALGUM is a large, freely accessible, genre-balanced English web corpus with multiple high-quality automatic annotations, designed to provide a more comprehensive and reliable resource than smaller, manually created datasets.
Contribution
It introduces a sizable, open web corpus with multiple annotation layers, addressing issues of imbalance, licensing, and quality in existing resources.
Findings
Achieves a 'better than NLP' benchmark through multi-layered annotations.
Provides high-quality automatic annotations including dependency trees and discourse structures.
Offers a balanced, large-scale corpus as an alternative to smaller datasets.
Abstract
We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a "better than NLP" benchmark and evaluate the accuracy of the resulting resource.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
