Optimizing XML Compression

Gregory Leighton; Denilson Barbosa

arXiv:0905.4761·cs.DB·May 13, 2015

Optimizing XML Compression

Gregory Leighton, Denilson Barbosa

PDF

TL;DR

This paper investigates XML compression, demonstrating that optimizing compression configurations is NP-hard, and proposes an approximation algorithm for partitioning document content to improve compression gain.

Contribution

It formally proves the NP-hardness of optimizing XML compression configurations and introduces a branch-and-bound based approximation algorithm for content partitioning.

Findings

01

Optimal compression configuration problem is NP-hard.

02

Proposed approximation algorithm improves compression partitioning.

03

Analysis guides better XML compression strategies.

Abstract

The eXtensible Markup Language (XML) provides a powerful and flexible means of encoding and exchanging data. As it turns out, its main advantage as an encoding format (namely, its requirement that all open and close markup tags are present and properly balanced) yield also one of its main disadvantages: verbosity. XML-conscious compression techniques seek to overcome this drawback. Many of these techniques first separate XML structure from the document content, and then compress each independently. Further compression gains can be realized by identifying and compressing together document content that is highly similar, thereby amortizing the storage costs of auxiliary information required by the chosen compression algorithm. Additionally, the proper choice of compression algorithm is an important factor not only for the achievable compression gain, but also for access performance.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.