Mining Measured Information from Text

Arun S. Maiya; Dale Visser; Andrew Wan

arXiv:1505.01072·cs.CL·May 6, 2015

Mining Measured Information from Text

Arun S. Maiya, Dale Visser, Andrew Wan

PDF

Open Access

TL;DR

This paper introduces a robust rule-based system for extracting measured quantities and their properties from text, supporting diverse units and formats, and demonstrates its application in a specialized search engine.

Contribution

It presents a novel rule-based method for extracting measured information and properties from text, including handling format errors, and integrates this into a search engine.

Findings

01

Supports a wide range of measurement units

02

Robust to format conversion errors

03

Enables specialized search for measured data

Abstract

We present an approach to extract measured information from text (e.g., a 1370 degrees C melting point, a BMI greater than 29.9 kg/m^2 ). Such extractions are critically important across a wide range of domains - especially those involving search and exploration of scientific and technical documents. We first propose a rule-based entity extractor to mine measured quantities (i.e., a numeric value paired with a measurement unit), which supports a vast and comprehensive set of both common and obscure measurement units. Our method is highly robust and can correctly recover valid measured quantities even when significant errors are introduced through the process of converting document formats like PDF to plain text. Next, we describe an approach to extracting the properties being measured (e.g., the property "pixel pitch" in the phrase "a pixel pitch as high as 352 {\mu}m"). Finally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Handwritten Text Recognition Techniques · Mathematics, Computing, and Information Processing