
TL;DR
This paper investigates XML compression, demonstrating that optimizing compression configurations is NP-hard, and proposes an approximation algorithm for partitioning document content to improve compression gain.
Contribution
It formally proves the NP-hardness of optimizing XML compression configurations and introduces a branch-and-bound based approximation algorithm for content partitioning.
Findings
Optimal compression configuration problem is NP-hard.
Proposed approximation algorithm improves compression partitioning.
Analysis guides better XML compression strategies.
Abstract
The eXtensible Markup Language (XML) provides a powerful and flexible means of encoding and exchanging data. As it turns out, its main advantage as an encoding format (namely, its requirement that all open and close markup tags are present and properly balanced) yield also one of its main disadvantages: verbosity. XML-conscious compression techniques seek to overcome this drawback. Many of these techniques first separate XML structure from the document content, and then compress each independently. Further compression gains can be realized by identifying and compressing together document content that is highly similar, thereby amortizing the storage costs of auxiliary information required by the chosen compression algorithm. Additionally, the proper choice of compression algorithm is an important factor not only for the achievable compression gain, but also for access performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
