Refining BERT Embeddings for Document Hashing via Mutual Information Maximization
Zijing Ou, Qinliang Su, Jianxing Yu, Ruihui Zhao, Yefeng Zheng, Bang, Liu

TL;DR
This paper introduces a novel unsupervised document hashing method that leverages BERT embeddings and mutual information maximization to produce more effective hash codes, outperforming traditional feature-based approaches.
Contribution
The paper proposes a new hashing paradigm based on mutual information maximization that effectively utilizes BERT embeddings for document hashing, overcoming limitations of generative models.
Findings
Significant improvement over BOW-based hashing methods.
Effective use of BERT embeddings in an unsupervised hashing framework.
Outperforms existing methods on benchmark datasets.
Abstract
Existing unsupervised document hashing methods are mostly established on generative models. Due to the difficulties of capturing long dependency structures, these methods rarely model the raw documents directly, but instead to model the features extracted from them (e.g. bag-of-words (BOW), TFIDF). In this paper, we propose to learn hash codes from BERT embeddings after observing their tremendous successes on downstream tasks. As a first try, we modify existing generative hashing models to accommodate the BERT embeddings. However, little improvement is observed over the codes learned from the old BOW or TFIDF features. We attribute this to the reconstruction requirement in the generative hashing, which will enforce irrelevant information that is abundant in the BERT embeddings also compressed into the codes. To remedy this issue, a new unsupervised hashing paradigm is further proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Text and Document Classification Technologies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Softmax · Attention Dropout · WordPiece · Layer Normalization · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia?
