DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging
Sheng Chen, Akshay Soni, Aasish Pappu, Yashar Mehdad

TL;DR
DocTag2Vec is a novel embedding-based approach that jointly learns representations of words, documents, and tags to improve multi-label document tagging directly from raw text, outperforming existing methods.
Contribution
It extends Word2Vec and Doc2Vec models to jointly embed words, documents, and tags, enabling effective multi-label tagging and handling new tags without feature engineering.
Findings
Outperforms state-of-the-art methods on multiple datasets
Learns meaningful tag and document representations
Handles new tags dynamically
Abstract
Tagging news articles or blog posts with relevant tags from a collection of predefined ones is coined as document tagging in this work. Accurate tagging of articles can benefit several downstream applications such as recommendation and search. In this work, we propose a novel yet simple approach called DocTag2Vec to accomplish this task. We substantially extend Word2Vec and Doc2Vec---two popular models for learning distributed representation of words and documents. In DocTag2Vec, we simultaneously learn the representation of words, documents, and tags in a joint vector space during training, and employ the simple -nearest neighbor search to predict tags for unseen documents. In contrast to previous multi-label learning methods, DocTag2Vec directly deals with raw text instead of provided feature vector, and in addition, enjoys advantages like the learning of tag representation, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Topic Modeling · Natural Language Processing Techniques
