PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents
Tingwei Xie, Tianyi Zhou, Yonghong Song

TL;DR
PharmaShip introduces a comprehensive Chinese dataset for pharmaceutical shipping documents, emphasizing the importance of reading order and geometry-aware models to improve information extraction in noisy, real-world scenarios.
Contribution
The paper presents PharmaShip, a new benchmark dataset and evaluation protocol for entity recognition, relation extraction, and reading order prediction in Chinese pharmaceutical documents, highlighting the benefits of combining pixel and geometry information.
Findings
Explicit geometry and pixel information are complementary for model performance.
Reading-order regularization improves robustness and accuracy.
Longer positional coverage stabilizes predictions and reduces artifacts.
Abstract
We present PharmaShip, a real-world Chinese dataset of scanned pharmaceutical shipping documents designed to stress-test pre-trained text-layout models under noisy OCR and heterogeneous templates. PharmaShip covers three complementary tasks-sequence entity recognition (SER), relation extraction (RE), and reading order prediction (ROP)-and adopts an entity-centric evaluation protocol to minimize confounds across architectures. We benchmark five representative baselines spanning pixel-aware and geometry-aware families (LiLT, LayoutLMv3-base, GeoLayoutLM and their available RORE-enhanced variants), and standardize preprocessing, splits, and optimization. Experiments show that pixels and explicit geometry provide complementary inductive biases, yet neither alone is sufficient: injecting reading-order-oriented regularization consistently improves SER and EL and yields the most robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Handwritten Text Recognition Techniques · Machine Learning in Healthcare
