Text vectorization via transformer-based language models and n-gram perplexities
Mihailo \v{S}kori\'c

TL;DR
This paper introduces a novel method for text representation that uses token-level n-gram perplexities to create vectors, addressing limitations of traditional scalar perplexity in capturing internal probability distributions.
Contribution
It proposes a new algorithm for computing token-level perplexity vectors, enhancing the analysis of text probability distributions beyond scalar measures.
Findings
Token-level perplexity vectors provide more detailed text representations.
The method captures internal probability distribution nuances.
Addresses limitations of traditional scalar perplexity.
Abstract
As the probability (and thus perplexity) of a text is calculated based on the product of the probabilities of individual tokens, it may happen that one unlikely token significantly reduces the probability (i.e., increase the perplexity) of some otherwise highly probable input, while potentially representing a simple typographical error. Also, given that perplexity is a scalar value that refers to the entire input, information about the probability distribution within it is lost in the calculation (a relatively good text that has one unlikely token and another text in which each token is equally likely they can have the same perplexity value), especially for longer texts. As an alternative to scalar perplexity this research proposes a simple algorithm used to calculate vector values based on n-gram perplexities within the input. Such representations consider the previously mentioned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification
