P-SIF: Document Embeddings Using Partition Averaging
Vivek Gupta, Ankit Saw, Pegah Nokhiz, Praneeth Netrapalli, Piyush Rai,, Partha Talukdar

TL;DR
P-SIF is a novel document embedding method that partitions documents into topics, learns topic-specific vectors, and concatenates them, improving representation quality over traditional averaging methods especially for long, multi-topic documents.
Contribution
The paper introduces P-SIF, a simple yet effective partitioned averaging approach that accounts for topical structure in long documents, enhancing embedding quality.
Findings
P-SIF outperforms simple averaging in document classification tasks.
Theoretical analysis supports the correctness of the partitioned approach.
Experimental results show significant improvements over baseline models.
Abstract
Simple weighted averaging of word vectors often yields effective representations for sentences which outperform sophisticated seq2seq neural models in many tasks. While it is desirable to use the same method to represent documents as well, unfortunately, the effectiveness is lost when representing long documents involving multiple sentences. One of the key reasons is that a longer document is likely to contain words from many different topics; hence, creating a single vector while ignoring all the topical structure is unlikely to yield an effective document representation. This problem is less acute in single sentences and other short text fragments where the presence of a single topic is most likely. To alleviate this problem, we present P-SIF, a partitioned word averaging model to represent long documents. P-SIF retains the simplicity of simple weighted word averaging while taking a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
