Robust Document Representations using Latent Topics and Metadata

Natraj Raman; Armineh Nourbakhsh; Sameena Shah; Manuela Veloso

arXiv:2010.12681·cs.CL·October 27, 2020

Robust Document Representations using Latent Topics and Metadata

Natraj Raman, Armineh Nourbakhsh, Sameena Shah, Manuela Veloso

PDF

Open Access

TL;DR

This paper introduces a novel self-supervised method for generating document representations that incorporate text and metadata, improving classification performance especially with limited labeled data.

Contribution

The authors propose a task-agnostic, self-supervised approach using latent topics and explicit metadata to create robust document embeddings for classification.

Findings

01

Outperforms several baselines on multiple datasets

02

Effective with small labeled datasets

03

Embeddings exhibit compositional characteristics

Abstract

Task specific fine-tuning of a pre-trained neural language model using a custom softmax output layer is the de facto approach of late when dealing with document classification problems. This technique is not adequate when labeled examples are not available at training time and when the metadata artifacts in a document must be exploited. We address these challenges by generating document representations that capture both text and metadata artifacts in a task agnostic manner. Instead of traditional auto-regressive or auto-encoding based training, our novel self-supervised approach learns a soft-partition of the input space when generating text embeddings. Specifically, we employ a pre-learned topic model distribution as surrogate labels and construct a loss function based on KL divergence. Our solution also incorporates metadata explicitly rather than just augmenting them with text. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies

MethodsSoftmax