Identifying Excessively Rounded or Truncated Data

Kevin H. Knuth; J. Patrick Castle; and Kevin R. Wheeler

arXiv:1602.04292·physics.data-an·February 16, 2016

Identifying Excessively Rounded or Truncated Data

Kevin H. Knuth, J. Patrick Castle, and Kevin R. Wheeler

PDF

Open Access

TL;DR

This paper presents a simple method using optimal histogram binning to detect when digitization effects in data are significant enough to cause information loss, ensuring data quality before analysis.

Contribution

It introduces a novel, straightforward technique to identify excessive rounding or truncation in digitized data using optimal histogram binning.

Findings

01

Effective detection of digitization artifacts in data sets

02

Ability to identify when digitization impacts data structure

03

Prevents irreversible information loss in data analysis

Abstract

All data are digitized, and hence are essentially integers rather than true real numbers. Ordinarily this causes no difficulties since the truncation or rounding usually occurs below the noise level. However, in some instances, when the instruments or data delivery and storage systems are designed with less than optimal regard for the data or the subsequent data analysis, the effects of digitization may be comparable to important features contained within the data. In these cases, information has been irrevocably lost in the truncation process. While there exist techniques for dealing with truncated data, we propose a straightforward method that will allow us to detect this problem before the data analysis stage. It is based on an optimal histogram binning algorithm that can identify when the statistical structure of the digitization is on the order of the statistical structure of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting · Neural Networks and Applications · Computational Physics and Python Applications