TL;DR
Harvest is an open source toolkit designed to accurately extract forum posts and metadata from diverse web forums, addressing challenges like non-standard structures and improving extraction quality over previous systems.
Contribution
The paper introduces a novel method for precise XPath determination and metadata extraction, along with the Harvest toolkit that outperforms existing solutions.
Findings
Harvest achieves higher accuracy in post extraction.
It effectively extracts author and thread metadata.
The toolkit is validated on 52 diverse forums.
Abstract
Automatic extraction of forum posts and metadata is a crucial but challenging task since forums do not expose their content in a standardized structure. Content extraction methods, therefore, often need customizations such as adaptations to page templates and improvements of their extraction code before they can be deployed to new forums. Most of the current solutions are also built for the more general case of content extraction from web pages and lack key features important for understanding forum content such as the identification of author metadata and information on the thread structure. This paper, therefore, presents a method that determines the XPath of forum posts, eliminating incorrect mergers and splits of the extracted posts that were common in systems from the previous generation. Based on the individual posts further metadata such as authors, forum URL and structure are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
