LAME: Layout Aware Metadata Extraction Approach for Research Articles
Jongyun Choi, Hyesoo Kong, Hwamook Yoon, Heung-Seon Oh, Yuchul Jung

TL;DR
This paper introduces LAME, a layout-aware framework for extracting metadata from diverse research article layouts, utilizing automatic layout analysis, a large training set, and Layout-MetaBERT, achieving high accuracy on unseen formats.
Contribution
The paper presents a novel layout-aware metadata extraction framework with an automatic layout analysis, a large training dataset, and a specialized BERT model for diverse journal layouts.
Findings
Achieved 93.27% Macro-F1 in metadata extraction on unseen journal layouts.
Demonstrated robustness of Layout-MetaBERT across diverse layout formats.
Constructed a large, automatically extracted training dataset for metadata extraction.
Abstract
The volume of academic literature, such as academic conference papers and journals, has increased rapidly worldwide, and research on metadata extraction is ongoing. However, high-performing metadata extraction is still challenging due to diverse layout formats according to journal publishers. To accommodate the diversity of the layouts of academic journals, we propose a novel LAyout-aware Metadata Extraction (LAME) framework equipped with the three characteristics (e.g., design of an automatic layout analysis, construction of a large meta-data training set, and construction of Layout-MetaBERT). We designed an automatic layout analysis using PDFMiner. Based on the layout analysis, a large volume of metadata-separated training data, including the title, abstract, author name, author affiliated organization, and keywords, were automatically extracted. Moreover, we constructed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Image Retrieval and Classification Techniques · Image Processing and 3D Reconstruction
