A Web Scale Entity Extraction System

Xuanting Cai; Quanbin Ma; Pan Li; Jianyu Liu; Qi Zeng; Zhengkan Yang,; Pushkar Tripathi

arXiv:2110.00423·cs.CL·October 4, 2021

A Web Scale Entity Extraction System

Xuanting Cai, Quanbin Ma, Pan Li, Jianyu Liu, Qi Zeng, Zhengkan Yang,, Pushkar Tripathi

PDF

Open Access

TL;DR

This paper presents a large-scale, multi-modal Transformer-based entity extraction system for diverse web content, demonstrating effective multilingual, multi-task, and cross-document learning strategies to improve accuracy.

Contribution

It introduces novel multi-modal Transformer techniques and label collection schemes for scalable, accurate web-scale entity extraction across multiple document types.

Findings

01

Multi-lingual, multi-task learning improves extraction accuracy.

02

Cross-document type learning enhances model generalization.

03

Effective label collection reduces data noise.

Abstract

Understanding the semantic meaning of content on the web through the lens of entities and concepts has many practical advantages. However, when building large-scale entity extraction systems, practitioners are facing unique challenges involving finding the best ways to leverage the scale and variety of data available on internet platforms. We present learnings from our efforts in building an entity extraction system for multiple document types at large scale using multi-modal Transformers. We empirically demonstrate the effectiveness of multi-lingual, multi-task and cross-document type learning. We also discuss the label collection schemes that help to minimize the amount of noise in the collected data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Web Data Mining and Analysis · Natural Language Processing Techniques