ASET: Ad-hoc Structured Exploration of Text Collections [Extended Abstract]
Benjamin H\"attasch, Jan-Micha Bodensohn, Carsten Binnig

TL;DR
ASET is a system enabling flexible, ad-hoc structured exploration of text collections by extracting information nuggets and matching them to user-defined structured tables using embeddings, without pre-designed pipelines.
Contribution
The paper introduces ASET, a novel two-phase approach that allows ad-hoc structured data extraction from text collections using existing extractors and embedding-based matching.
Findings
High-quality extraction from real-world texts
No need for upfront pipeline design
Effective matching to user-defined structures
Abstract
In this paper, we propose a new system called ASET that allows users to perform structured explorations of text collections in an ad-hoc manner. The main idea of ASET is to use a new two-phase approach that first extracts a superset of information nuggets from the texts using existing extractors such as named entity recognizers and then matches the extractions to a structured table definition as requested by the user based on embeddings. In our evaluation, we show that ASET is thus able to extract structured data from real-world text collections in high quality without the need to design extraction pipelines upfront.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Data Quality and Management · Topic Modeling
