PCSI -- The Platform for Content-Structure Inference
Caleb Malchik, Joan Feigenbaum

TL;DR
PCSI is a platform that captures and shares methods for converting web resources into structured content, using scripts in Hex to process HTML DOM and record results for different URLs.
Contribution
It introduces a system for encoding, sharing, and applying content-structure inference methods across web resources.
Findings
Enables sharing of content extraction methods.
Supports diverse URL classes with specific scripts.
Facilitates reproducibility of content structuring.
Abstract
The Platform for Content-Structure Inference (PCSI, pronounced "pixie") facilitates the sharing of information about the process of converting Web resources into structured content objects that conform to a predefined format. PCSI records encode methods for deriving structured content from classes of URLs, and report the results of applying particular methods to particular URLs. The methods are scripts written in Hex, a variant of Awk with facilities for traversing the HTML DOM.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Natural Language Processing Techniques · Library Science and Information Systems
