TADOC: Text Analytics Directly on Compression

Feng Zhang; Jidong Zhai; Xipeng Shen; Dalin Wang; Zheng Chen; Onur; Mutlu; Wenguang Chen; Xiaoyong Du

arXiv:2009.09442·cs.DS·September 22, 2020

TADOC: Text Analytics Directly on Compression

Feng Zhang, Jidong Zhai, Xipeng Shen, Dalin Wang, Zheng Chen, Onur, Mutlu, Wenguang Chen, Xiaoyong Du

PDF

TL;DR

TADOC introduces a method for performing text analytics directly on compressed data, significantly reducing storage, memory, and processing time without decompressing the data.

Contribution

The paper presents novel algorithms and a hierarchical compression approach enabling direct analytics on compressed text, addressing key challenges in the field.

Findings

01

Reduces storage space by 90.8%

02

Decreases memory usage by 87.9%

03

Halves data processing times

Abstract

This article provides a comprehensive description of Text Analytics Directly on Compression (TADOC), which enables direct document analytics on compressed textual data. The article explains the concept of TADOC and the challenges to its effective realizations. Additionally, a series of guidelines and technical solutions that effectively address those challenges, including the adoption of a hierarchical compression method and a set of novel algorithms and data structure designs, are presented. Experiments on six data analytics tasks of various complexities show that TADOC can save 90.8% storage space and 87.9% memory usage, while halving data processing times.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.