Putting Self-Supervised Token Embedding on the Tables
Marc Szafraniec, Gautier Marti, Philippe Donnat

TL;DR
This paper introduces SC2T, a self-supervised model that creates vector representations of tokens in semi-structured tables, improving information extraction from fuzzy or variably structured text data.
Contribution
The paper presents a novel self-supervised approach, SC2T, for embedding tokens in semi-structured messages, addressing limitations of traditional regex-based methods.
Findings
Effective token embeddings for semi-structured data
Unsupervised token labeling capability
Foundation for semi-supervised information extraction
Abstract
Information distribution by electronic messages is a privileged means of transmission for many businesses and individuals, often under the form of plain-text tables. As their number grows, it becomes necessary to use an algorithm to extract text and numbers instead of a human. Usual methods are focused on regular expressions or on a strict structure in the data, but are not efficient when we have many variations, fuzzy structure or implicit labels. In this paper we introduce SC2T, a totally self-supervised model for constructing vector representations of tokens in semi-structured messages by using characters and context levels that address these issues. It can then be used for an unsupervised labeling of tokens, or be the basis for a semi-supervised information extraction system.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
