A Formal Framework for Linguistic Annotation
Steven Bird, Mark Liberman (University of Pennsylvania)

TL;DR
This paper introduces a formal framework called annotation graphs for representing and managing linguistic annotations across various data formats, addressing the lack of standardization in the field.
Contribution
It proposes a unified logical structure for linguistic annotations, enabling interoperability and consistency across different formats and tools.
Findings
Demonstrates a common conceptual core among diverse annotation formats.
Provides a formal framework for constructing and searching annotations.
Ensures compatibility with multiple data structures and formats.
Abstract
`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions -- audio, video and/or physiological recordings -- or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entity' identification, co-reference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focussed on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems
