AstroCompress: A benchmark dataset for multi-purpose compression of astronomical data
Tuan Truong, Rithwik Sudharsan, Yibo Yang, Peter Xiangyuan Ma, Ruihan Yang, Stephan Mandt, Joshua S. Bloom

TL;DR
AstroCompress introduces a new benchmark dataset for evaluating neural and classical lossless compression methods on diverse astronomical imaging data, aiming to improve data transmission efficiency in observatories.
Contribution
This paper presents AstroCompress, a comprehensive benchmark with datasets and evaluation tools for neural lossless compression of astronomical data, highlighting potential improvements over traditional methods.
Findings
Neural compression methods outperform classical algorithms on the benchmark.
Lossless neural techniques can significantly enhance data transmission efficiency.
The benchmark facilitates future research in astrophysical data compression.
Abstract
The site conditions that make astronomical observatories in space and on the ground so desirable -- cold and dark -- demand a physical remoteness that leads to limited data transmission capabilities. Such transmission limitations directly bottleneck the amount of data acquired and in an era of costly modern observatories, any improvements in lossless data compression has the potential scale to billions of dollars worth of additional science that can be accomplished on the same instrument. Traditional lossless methods for compressing astrophysical data are manually designed. Neural data compression, on the other hand, holds the promise of learning compression algorithms end-to-end from data and outperforming classical techniques by leveraging the unique spatial, temporal, and wavelength structures of astronomical images. This paper introduces AstroCompress: a neural compression challenge…
Peer Reviews
Decision·ICLR 2025 Poster
1. The released dataset is made open-accessible and well-organized for researchers to follow. 2. Several lossless compression codecs have been tested including both learned/traditional methods. Evaluations of the impact on datasets and methods are presented.
1. Since many of the datasets are from publicly accessible resources, therefore i suggest that the authors should contribute more over the analysis and deeper evaluation of the re-organized AstroCompress dataset. 2. The cross-dataset evaluation can be improved. In this paper, only brief conclusions about the generalization of different codecs are provided. A deeper evaluation of the similarity between different datasets and the reason for performing differently for different codecs is suggested.
The dataset composes of diverse images, from 2D to 4D, which is impressive. The selective of neural lossless codec baseline is smart. In fact. IDF, L3C and PixelCNN++ represent three major paradigms of lossless compression: normalizing flow, latent variable model and auto-regressive model. In fact, I can not think of a better choice if I need to choose three most representative lossless image codec.
One most evident weakness is that the authors evaluate 7 general lossless codec and only 1 astronomical codec (published in 2009). I have two explainations for this evaluation: * The authors's evaluation is not thorough, many astronomical codecs are omitted. * The astronomical data compression is not an active research area, and the only reasonable baseline is published 15 years ago. Either of the explaination makes me hesitate about accepting this paper. Besides, I am not really sure about how
The paper makes a strong contribution to astrological and neural compression research. The paper lays out clearly how astrological experiments depend on good compression and why a dataset specific to astrological data is necessary. The scope of the dataset is also impressive. It is likely that a researcher working on neural compression, lossless or otherwise, would be interested in findings from testing their method on this dataset. Likewise I can imagine astrological researchers using results o
The only thing which I would have liked to see more discussion on was the point about runtime/compression ratio. Is there any actionable recommendation here? Also given that JPEG-XL is mostly a reference implementation right now and hasn't been optimized while the neural methods require a GPU, is it fair comparison to say that it is slower?
Code & Models
Videos
Taxonomy
TopicsAdvanced Data Compression Techniques · Advanced Data Storage Technologies · Algorithms and Data Compression
