AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing
Huawei Ji, Cheng Deng, Bo Xue, Zhouyang Jin, Jiaxin Ding, and Xiaoying Gan, Luoyi Fu, Xinbing Wang, Chenghu Zhou

TL;DR
AceParse is a new comprehensive dataset that enables accurate parsing of diverse structured texts in academic literature, supporting improved data quality in AI applications.
Contribution
The paper introduces AceParse, the first dataset covering various structured texts in academic literature, and fine-tunes a multimodal model that surpasses previous methods in parsing accuracy.
Findings
AceParser outperforms previous state-of-the-art by 4.1% in F1 score.
AceParser achieves a 5% improvement in Jaccard Similarity.
AceParse enables better parsing of formulas, tables, and embedded mathematical expressions.
Abstract
With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various text structures. In this paper, we introduce AceParse, the first comprehensive dataset designed to support the parsing of a wide range of structured texts, including formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions. Based on AceParse, we fine-tuned a multimodal model, named AceParser, which accurately parses various structured texts within academic literature. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsFocus
