Text vectorization via transformer-based language models and n-gram   perplexities

Mihailo \v{S}kori\'c

arXiv:2307.09255·cs.CL·July 19, 2023

Text vectorization via transformer-based language models and n-gram perplexities

Mihailo \v{S}kori\'c

PDF

Open Access

TL;DR

This paper introduces a novel method for text representation that uses token-level n-gram perplexities to create vectors, addressing limitations of traditional scalar perplexity in capturing internal probability distributions.

Contribution

It proposes a new algorithm for computing token-level perplexity vectors, enhancing the analysis of text probability distributions beyond scalar measures.

Findings

01

Token-level perplexity vectors provide more detailed text representations.

02

The method captures internal probability distribution nuances.

03

Addresses limitations of traditional scalar perplexity.

Abstract

As the probability (and thus perplexity) of a text is calculated based on the product of the probabilities of individual tokens, it may happen that one unlikely token significantly reduces the probability (i.e., increase the perplexity) of some otherwise highly probable input, while potentially representing a simple typographical error. Also, given that perplexity is a scalar value that refers to the entire input, information about the probability distribution within it is lost in the calculation (a relatively good text that has one unlikely token and another text in which each token is equally likely they can have the same perplexity value), especially for longer texts. As an alternative to scalar perplexity this research proposes a simple algorithm used to calculate vector values based on n-gram perplexities within the input. Such representations consider the previously mentioned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification