Mining Measured Information from Text
Arun S. Maiya, Dale Visser, Andrew Wan

TL;DR
This paper introduces a robust rule-based system for extracting measured quantities and their properties from text, supporting diverse units and formats, and demonstrates its application in a specialized search engine.
Contribution
It presents a novel rule-based method for extracting measured information and properties from text, including handling format errors, and integrates this into a search engine.
Findings
Supports a wide range of measurement units
Robust to format conversion errors
Enables specialized search for measured data
Abstract
We present an approach to extract measured information from text (e.g., a 1370 degrees C melting point, a BMI greater than 29.9 kg/m^2 ). Such extractions are critically important across a wide range of domains - especially those involving search and exploration of scientific and technical documents. We first propose a rule-based entity extractor to mine measured quantities (i.e., a numeric value paired with a measurement unit), which supports a vast and comprehensive set of both common and obscure measurement units. Our method is highly robust and can correctly recover valid measured quantities even when significant errors are introduced through the process of converting document formats like PDF to plain text. Next, we describe an approach to extracting the properties being measured (e.g., the property "pixel pitch" in the phrase "a pixel pitch as high as 352 {\mu}m"). Finally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Handwritten Text Recognition Techniques · Mathematics, Computing, and Information Processing
