Textual Data Distributions: Kullback Leibler Textual Distributions Contrasts on GPT-2 Generated Texts, with Supervised, Unsupervised Learning on Vaccine & Market Topics & Sentiment
Jim Samuel, Ratnakar Palle, Eduardo Correa Soares

TL;DR
This paper introduces a novel method called KL-TDC for analyzing and validating the alignment of textual data distributions, especially in generated texts, using supervised, unsupervised learning, and GPT-2, to improve NLP applications involving topic and sentiment matching.
Contribution
It presents a new approach combining machine learning and a modified Kullback-Leibler divergence to assess the similarity between natural and machine-generated textual data distributions.
Findings
KL-TDC effectively measures alignment of generated and natural texts.
GPT-2 can produce topic and sentiment aligned texts.
Method aids in addressing data sparsity in NLP applications.
Abstract
Efficient textual data distributions (TDD) alignment and generation are open research problems in textual analytics and NLP. It is presently difficult to parsimoniously and methodologically confirm that two or more natural language datasets belong to similar distributions, and to identify the extent to which textual data possess alignment. This study focuses on addressing a segment of the broader problem described above by applying multiple supervised and unsupervised machine learning (ML) methods to explore the behavior of TDD by (i) topical alignment, and (ii) by sentiment alignment. Furthermore we use multiple text generation methods including fine-tuned GPT-2, to generate text by topic and by sentiment. Finally we develop a unique process driven variation of Kullback-Leibler divergence (KLD) application to TDD, named KL Textual Distributions Contrasts(KL-TDC) to identify the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Topic Modeling
MethodsLinear Layer · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Linear Warmup With Cosine Annealing · Attention Dropout · Softmax · Dense Connections · Attention Is All You Need · Adam
