Robust Document Representations using Latent Topics and Metadata
Natraj Raman, Armineh Nourbakhsh, Sameena Shah, Manuela Veloso

TL;DR
This paper introduces a novel self-supervised method for generating document representations that incorporate text and metadata, improving classification performance especially with limited labeled data.
Contribution
The authors propose a task-agnostic, self-supervised approach using latent topics and explicit metadata to create robust document embeddings for classification.
Findings
Outperforms several baselines on multiple datasets
Effective with small labeled datasets
Embeddings exhibit compositional characteristics
Abstract
Task specific fine-tuning of a pre-trained neural language model using a custom softmax output layer is the de facto approach of late when dealing with document classification problems. This technique is not adequate when labeled examples are not available at training time and when the metadata artifacts in a document must be exploited. We address these challenges by generating document representations that capture both text and metadata artifacts in a task agnostic manner. Instead of traditional auto-regressive or auto-encoding based training, our novel self-supervised approach learns a soft-partition of the input space when generating text embeddings. Specifically, we employ a pre-learned topic model distribution as surrogate labels and construct a loss function based on KL divergence. Our solution also incorporates metadata explicitly rather than just augmenting them with text. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
MethodsSoftmax
