Refining BERT Embeddings for Document Hashing via Mutual Information   Maximization

Zijing Ou; Qinliang Su; Jianxing Yu; Ruihui Zhao; Yefeng Zheng; Bang; Liu

arXiv:2109.02867·cs.IR·September 8, 2021

Refining BERT Embeddings for Document Hashing via Mutual Information Maximization

Zijing Ou, Qinliang Su, Jianxing Yu, Ruihui Zhao, Yefeng Zheng, Bang, Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel unsupervised document hashing method that leverages BERT embeddings and mutual information maximization to produce more effective hash codes, outperforming traditional feature-based approaches.

Contribution

The paper proposes a new hashing paradigm based on mutual information maximization that effectively utilizes BERT embeddings for document hashing, overcoming limitations of generative models.

Findings

01

Significant improvement over BOW-based hashing methods.

02

Effective use of BERT embeddings in an unsupervised hashing framework.

03

Outperforms existing methods on benchmark datasets.

Abstract

Existing unsupervised document hashing methods are mostly established on generative models. Due to the difficulties of capturing long dependency structures, these methods rarely model the raw documents directly, but instead to model the features extracted from them (e.g. bag-of-words (BOW), TFIDF). In this paper, we propose to learn hash codes from BERT embeddings after observing their tremendous successes on downstream tasks. As a first try, we modify existing generative hashing models to accommodate the BERT embeddings. However, little improvement is observed over the codes learned from the old BOW or TFIDF features. We attribute this to the reconstruction requirement in the generative hashing, which will enforce irrelevant information that is abundant in the BERT embeddings also compressed into the codes. To remedy this issue, a new unsupervised hashing paradigm is further proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

j-zin/dhim
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Text and Document Classification Technologies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Softmax · Attention Dropout · WordPiece · Layer Normalization · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia?