Compressed-Language Models for Understanding Compressed File Formats: a JPEG Exploration
Juan C. P\'erez, Alejandro Pardo, Mattia Soldan, Hani Itani, Juan, Leon-Alcazar, Bernard Ghanem

TL;DR
This paper explores whether Compressed-Language Models can understand JPEG files directly from raw byte streams, demonstrating their ability to recognize properties, handle anomalies, and generate files, thus understanding compressed data semantics.
Contribution
It introduces the application of CLMs to JPEG files, showing they can interpret and manipulate compressed data directly from byte streams, a novel approach in data understanding.
Findings
CLMs can recognize inherent JPEG file properties.
CLMs can handle anomalies in JPEG files.
CLMs can generate new JPEG files.
Abstract
This study investigates whether Compressed-Language Models (CLMs), i.e. language models operating on raw byte streams from Compressed File Formats~(CFFs), can understand files compressed by CFFs. We focus on the JPEG format as a representative CFF, given its commonality and its representativeness of key concepts in compression, such as entropy coding and run-length encoding. We test if CLMs understand the JPEG format by probing their capabilities to perform along three axes: recognition of inherent file properties, handling of files with anomalies, and generation of new files. Our findings demonstrate that CLMs can effectively perform these tasks. These results suggest that CLMs can understand the semantics of compressed data when directly operating on the byte streams of files produced by CFFs. The possibility to directly operate on raw compressed files offers the promise to leverage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies
MethodsFocus
