PAWLS: PDF Annotation With Labels and Structure

Mark Neumann; Zejiang Shen; Sam Skjonsberg

arXiv:2101.10281·cs.CL·January 26, 2021·1 cites

PAWLS: PDF Annotation With Labels and Structure

Mark Neumann, Zejiang Shen, Sam Skjonsberg

PDF

Open Access 1 Repo

TL;DR

PAWLS is a specialized annotation tool for PDFs that enables detailed, multi-modal annotations including text spans, relations, and non-textual elements, facilitating NLP and multi-modal model training.

Contribution

The paper introduces PAWLS, a novel PDF annotation tool supporting complex annotations and extended context, tailored for NLP and multi-modal applications.

Findings

01

Supports span-based, relation, and bounding box annotations

02

Exports data suitable for multi-modal machine learning

03

Enhances annotation accuracy with extended context

Abstract

Adobe's Portable Document Format (PDF) is a popular way of distributing view-only documents with a rich visual markup. This presents a challenge to NLP practitioners who wish to use the information contained within PDF documents for training models or data analysis, because annotating these documents is difficult. In this paper, we present PDF Annotation with Labels and Structure (PAWLS), a new annotation tool designed specifically for the PDF document format. PAWLS is particularly suited for mixed-mode annotation and scenarios in which annotators require extended context to annotate accurately. PAWLS supports span-based textual annotation, N-ary relations and freeform, non-textual bounding boxes, all of which can be exported in convenient formats for training multi-modal machine learning models. A read-only PAWLS server is available at https://pawls.apps.allenai.org/ and the source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allenai/pawls
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques