Representing numeric data in 32 bits while preserving 64-bit precision
Radford M. Neal

TL;DR
This paper presents a method to compactly represent certain 64-bit floating-point numbers in 32 bits using table lookup, enabling faster decoding and potential automatic compression of large data arrays.
Contribution
It introduces a novel scheme for representing specific subsets of 64-bit floats in 32 bits with exact decoding via table lookup, improving efficiency over traditional decimal conversion.
Findings
Representation is exact for numbers with up to 6 decimal digits.
Decoding is faster than decimal floating-point conversion.
Suitable for compressing large data arrays in interpretive languages.
Abstract
Data files often consist of numbers having only a few significant decimal digits, whose information content would allow storage in only 32 bits. However, we may require that arithmetic operations involving these numbers be done with 64-bit floating-point precision, which precludes simply representing the data as 32-bit floating-point values. Decimal floating point gives a compact and exact representation, but requires conversion with a slow division operation before it can be used. Here, I show that interesting subsets of 64-bit floating-point values can be compactly and exactly represented by the 32 bits consisting of the sign, exponent, and high-order part of the mantissa, with the lower-order 32 bits of the mantissa filled in by table lookup, indexed by bits from the part of the mantissa retained, and possibly from the exponent. For example, decimal data with 4 or fewer digits to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNumerical Methods and Algorithms · Cryptography and Residue Arithmetic · Digital Filter Design and Implementation
