Revisiting Weighted Information Extraction: A Simpler and Faster Algorithm for Ranked Enumeration
Pawel Gawrychowski, Florin Manea, Markus L. Schmid

TL;DR
This paper introduces a simpler, faster algorithm for weighted information extraction that improves delay bounds and leverages shortest path enumeration techniques, combining algebra, geometry, and linear programming.
Contribution
It presents a new algorithm with linear preprocessing and improved delay bounds for weighted enumeration, surpassing previous methods in efficiency and simplicity.
Findings
Achieves linear preprocessing and delay of O(|s|) with high probability.
Significantly improves delay bounds over previous algorithms.
Combines algebra, geometry, and linear programming techniques.
Abstract
Information extraction from textual data, where the query is represented by a finite transducer and the task is to enumerate all results without repetition, and its extension to the weighted case, where each output element has a weight and the output elements are to be enumerated sorted by their weights, are important and well studied problems in database theory. On the one hand, the first framework already covers the well-known case of regular document spanners, while the latter setting covers several practically relevant tasks that cannot be described in the unweighted setting. It is known that in the unweighted case this problem can be solved with linear time preprocessing and output-linear delay in data complexity, where is the input data and is the current output element. For the weighted case, Bourhis, Grez, Jachiet, and Riveros [ICDT 2021] recently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Rough Sets and Fuzzy Logic · Advanced Database Systems and Queries
