PLM-GNN: A Webpage Classification Method based on Joint Pre-trained   Language Model and Graph Neural Network

Qiwei Lang; Jingbo Zhou; Haoyi Wang; Shiqi Lyu; Rui Zhang

arXiv:2305.05378·cs.CL·May 10, 2023·2 cites

PLM-GNN: A Webpage Classification Method based on Joint Pre-trained Language Model and Graph Neural Network

Qiwei Lang, Jingbo Zhou, Haoyi Wang, Shiqi Lyu, Rui Zhang

PDF

Open Access

TL;DR

PLM-GNN is a novel webpage classification approach that combines pre-trained language models and graph neural networks to jointly encode webpage text and HTML DOM structures, improving classification accuracy.

Contribution

The paper introduces PLM-GNN, a new method that jointly encodes webpage text and HTML DOM trees using pre-trained language models and GNNs, addressing feature engineering challenges.

Findings

01

Performs well on KI-04 and SWDE datasets

02

Effective on practical scholar homepage crawling dataset

03

Outperforms traditional feature-based methods

Abstract

The number of web pages is growing at an exponential rate, accumulating massive amounts of data on the web. It is one of the key processes to classify webpages in web information mining. Some classical methods are based on manually building features of web pages and training classifiers based on machine learning or deep learning. However, building features manually requires specific domain knowledge and usually takes a long time to validate the validity of features. Considering webpages generated by the combination of text and HTML Document Object Model(DOM) trees, we propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN. It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis