Putting Self-Supervised Token Embedding on the Tables

Marc Szafraniec; Gautier Marti; Philippe Donnat

arXiv:1708.04120·cs.IR·October 26, 2017

Putting Self-Supervised Token Embedding on the Tables

Marc Szafraniec, Gautier Marti, Philippe Donnat

PDF

TL;DR

This paper introduces SC2T, a self-supervised model that creates vector representations of tokens in semi-structured tables, improving information extraction from fuzzy or variably structured text data.

Contribution

The paper presents a novel self-supervised approach, SC2T, for embedding tokens in semi-structured messages, addressing limitations of traditional regex-based methods.

Findings

01

Effective token embeddings for semi-structured data

02

Unsupervised token labeling capability

03

Foundation for semi-supervised information extraction

Abstract

Information distribution by electronic messages is a privileged means of transmission for many businesses and individuals, often under the form of plain-text tables. As their number grows, it becomes necessary to use an algorithm to extract text and numbers instead of a human. Usual methods are focused on regular expressions or on a strict structure in the data, but are not efficient when we have many variations, fuzzy structure or implicit labels. In this paper we introduce SC2T, a totally self-supervised model for constructing vector representations of tokens in semi-structured messages by using characters and context levels that address these issues. It can then be used for an unsupervised labeling of tokens, or be the basis for a semi-supervised information extraction system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.