Compressed-Language Models for Understanding Compressed File Formats: a   JPEG Exploration

Juan C. P\'erez; Alejandro Pardo; Mattia Soldan; Hani Itani; Juan; Leon-Alcazar; Bernard Ghanem

arXiv:2405.17146·cs.CV·May 28, 2024·1 cites

Compressed-Language Models for Understanding Compressed File Formats: a JPEG Exploration

Juan C. P\'erez, Alejandro Pardo, Mattia Soldan, Hani Itani, Juan, Leon-Alcazar, Bernard Ghanem

PDF

Open Access

TL;DR

This paper explores whether Compressed-Language Models can understand JPEG files directly from raw byte streams, demonstrating their ability to recognize properties, handle anomalies, and generate files, thus understanding compressed data semantics.

Contribution

It introduces the application of CLMs to JPEG files, showing they can interpret and manipulate compressed data directly from byte streams, a novel approach in data understanding.

Findings

01

CLMs can recognize inherent JPEG file properties.

02

CLMs can handle anomalies in JPEG files.

03

CLMs can generate new JPEG files.

Abstract

This study investigates whether Compressed-Language Models (CLMs), i.e. language models operating on raw byte streams from Compressed File Formats~(CFFs), can understand files compressed by CFFs. We focus on the JPEG format as a representative CFF, given its commonality and its representativeness of key concepts in compression, such as entropy coding and run-length encoding. We test if CLMs understand the JPEG format by probing their capabilities to perform along three axes: recognition of inherent file properties, handling of files with anomalies, and generation of new files. Our findings demonstrate that CLMs can effectively perform these tasks. These results suggest that CLMs can understand the semantics of compressed data when directly operating on the byte streams of files produced by CFFs. The possibility to directly operate on raw compressed files offers the promise to leverage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies

MethodsFocus