Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data
Payam Siyari, Bistra Dilkina, Constantine Dovrolis

TL;DR
Lexis is a framework that constructs optimized hierarchical representations of string data, revealing underlying structures and facilitating applications like DNA synthesis, protein analysis, text compression, and document feature extraction.
Contribution
It introduces the Lexis framework for hierarchical string representation, proves NP-hardness of the optimization problem, and offers an efficient greedy algorithm for practical construction.
Findings
Successfully constructs minimal hierarchical string representations.
Applicable to diverse fields such as genomics, proteomics, and text analysis.
Provides insights into the core substrings within hierarchical structures.
Abstract
Data represented as strings abounds in biology, linguistics, document mining, web search and many other fields. Such data often have a hierarchical structure, either because they were artificially designed and composed in a hierarchical manner or because there is an underlying evolutionary process that creates repeatedly more complex strings from simpler substrings. We propose a framework, referred to as "Lexis", that produces an optimized hierarchical representation of a given set of "target" strings. The resulting hierarchy, "Lexis-DAG", shows how to construct each target through the concatenation of intermediate substrings, minimizing the total number of such concatenations or DAG edges. The Lexis optimization problem is related to the smallest grammar problem. After we prove its NP-Hardness for two cost formulations, we propose an efficient greedy algorithm for the construction of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
