AMALGUM -- A Free, Balanced, Multilayer English Web Corpus

Luke Gessler; Siyao Peng; Yang Liu; Yilun Zhu; Shabnam Behzad; Amir; Zeldes

arXiv:2006.10677·cs.CL·June 19, 2020·5 cites

AMALGUM -- A Free, Balanced, Multilayer English Web Corpus

Luke Gessler, Siyao Peng, Yang Liu, Yilun Zhu, Shabnam Behzad, Amir, Zeldes

PDF

Open Access 1 Repo

TL;DR

AMALGUM is a large, freely accessible, genre-balanced English web corpus with multiple high-quality automatic annotations, designed to provide a more comprehensive and reliable resource than smaller, manually created datasets.

Contribution

It introduces a sizable, open web corpus with multiple annotation layers, addressing issues of imbalance, licensing, and quality in existing resources.

Findings

01

Achieves a 'better than NLP' benchmark through multi-layered annotations.

02

Provides high-quality automatic annotations including dependency trees and discourse structures.

03

Offers a balanced, large-scale corpus as an alternative to smaller datasets.

Abstract

We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a "better than NLP" benchmark and evaluate the accuracy of the resulting resource.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gucorpling/amalgum
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling