WYSIWYE: An Algebra for Expressing Spatial and Textual Rules for Visual Information Extraction
Vijil Chenthamarakshan, Prasad M Desphande, Raghu Krishnapuram,, Ramakrishna Varadarajan, Knut Stolze

TL;DR
This paper introduces WYSIWYE, an algebraic framework that allows declarative, layout-based rules for extracting information from web pages, improving robustness and enabling new rule types compared to traditional HTML source-based methods.
Contribution
The paper presents a novel algebraic framework for spatial and textual rule specification in web information extraction, integrating layout-level rules with traditional text-based approaches.
Findings
The framework simplifies rule specification and improves robustness.
Efficient implementation using relational database features.
Effective extraction demonstrated on software requirement pages.
Abstract
The visual layout of a webpage can provide valuable clues for certain types of Information Extraction (IE) tasks. In traditional rule based IE frameworks, these layout cues are mapped to rules that operate on the HTML source of the webpages. In contrast, we have developed a framework in which the rules can be specified directly at the layout level. This has many advantages, since the higher level of abstraction leads to simpler extraction rules that are largely independent of the source code of the page, and, therefore, more robust. It can also enable specification of new types of rules that are not otherwise possible. To the best of our knowledge, there is no general framework that allows declarative specification of information extraction rules based on spatial layout. Our framework is complementary to traditional text based rules framework and allows a seamless combination of spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Caching and Content Delivery · Web Applications and Data Management
