TADOC: Text Analytics Directly on Compression
Feng Zhang, Jidong Zhai, Xipeng Shen, Dalin Wang, Zheng Chen, Onur, Mutlu, Wenguang Chen, Xiaoyong Du

TL;DR
TADOC introduces a method for performing text analytics directly on compressed data, significantly reducing storage, memory, and processing time without decompressing the data.
Contribution
The paper presents novel algorithms and a hierarchical compression approach enabling direct analytics on compressed text, addressing key challenges in the field.
Findings
Reduces storage space by 90.8%
Decreases memory usage by 87.9%
Halves data processing times
Abstract
This article provides a comprehensive description of Text Analytics Directly on Compression (TADOC), which enables direct document analytics on compressed textual data. The article explains the concept of TADOC and the challenges to its effective realizations. Additionally, a series of guidelines and technical solutions that effectively address those challenges, including the adoption of a hierarchical compression method and a set of novel algorithms and data structure designs, are presented. Experiments on six data analytics tasks of various complexities show that TADOC can save 90.8% storage space and 87.9% memory usage, while halving data processing times.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
