Identifying the Units of Measurement in Tabular Data

Taha Ceritli; Christopher K. I. Williams

arXiv:2111.11959·cs.LG·November 26, 2021

Identifying the Units of Measurement in Tabular Data

Taha Ceritli, Christopher K. I. Williams

PDF

Open Access 1 Repo

TL;DR

This paper introduces PUC, a probabilistic method for accurately identifying units of measurement in tabular data columns, extracting semantic descriptions, and canonicalizing entries, with improved performance over existing solutions.

Contribution

The paper presents PUC, the first probabilistic approach for unit identification in messy real-world tabular data, along with annotated datasets for further research.

Findings

01

PUC outperforms existing solutions in unit identification accuracy.

02

First annotated datasets for units in real-world tabular data.

03

Effective extraction and canonicalization of measurement units.

Abstract

We consider the problem of identifying the units of measurement in a data column that contains both numeric values and unit symbols in each row, e.g., "5.2 l", "7 pints". In this case we seek to identify the dimension of the column (e.g. volume) and relate the unit symbols to valid units (e.g. litre, pint) obtained from a knowledge graph. Below we present PUC, a Probabilistic Unit Canonicalizer that can accurately identify the units of measurement, extract semantic descriptions of quantitative data columns and canonicalize their entries. We present the first messy real-world tabular datasets annotated for units of measurement, which can enable and accelerate the research in this area. Our experiments on these datasets show that PUC achieves better results than existing solutions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tahaceritli/puc
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting · Data Visualization and Analytics · Rough Sets and Fuzzy Logic