Pattern discovery for semi-structured web pages using bar-tree representation
Z. Akbar, L.T. Handoko

TL;DR
This paper introduces the bar-tree representation for pattern discovery in semi-structured web pages, enabling efficient recognition of template changes and improving data extraction accuracy.
Contribution
The novel bar-tree representation and reverse algorithm improve pattern recognition and change detection in semi-structured web pages compared to previous methods.
Findings
High recognition rate for template changes
Efficient pattern description using bar graphs
Effective detection of pattern modifications
Abstract
Many websites with an underlying database containing structured data provide the richest and most dense source of information relevant for topical data integration. The real data integration requires sustainable and reliable pattern discovery to enable accurate content retrieval and to recognize pattern changes from time to time; yet, extracting the structured data from web documents is still lacking from its accuracy. This paper proposes the bar-tree representation to describe the whole pattern of web pages in an efficient way based on the reverse algorithm. While previous algorithms always trace the pattern and extract the region of interest from \textit{top root}, the reverse algorithm recognizes the pattern from the region of interest to both top and bottom roots simultaneously. The attributes are then extracted and labeled reversely from the region of interest of targeted contents.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
