Identifying the Units of Measurement in Tabular Data
Taha Ceritli, Christopher K. I. Williams

TL;DR
This paper introduces PUC, a probabilistic method for accurately identifying units of measurement in tabular data columns, extracting semantic descriptions, and canonicalizing entries, with improved performance over existing solutions.
Contribution
The paper presents PUC, the first probabilistic approach for unit identification in messy real-world tabular data, along with annotated datasets for further research.
Findings
PUC outperforms existing solutions in unit identification accuracy.
First annotated datasets for units in real-world tabular data.
Effective extraction and canonicalization of measurement units.
Abstract
We consider the problem of identifying the units of measurement in a data column that contains both numeric values and unit symbols in each row, e.g., "5.2 l", "7 pints". In this case we seek to identify the dimension of the column (e.g. volume) and relate the unit symbols to valid units (e.g. litre, pint) obtained from a knowledge graph. Below we present PUC, a Probabilistic Unit Canonicalizer that can accurately identify the units of measurement, extract semantic descriptions of quantitative data columns and canonicalize their entries. We present the first messy real-world tabular datasets annotated for units of measurement, which can enable and accelerate the research in this area. Our experiments on these datasets show that PUC achieves better results than existing solutions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting · Data Visualization and Analytics · Rough Sets and Fuzzy Logic
