Language Models As Semantic Indexers
Bowen Jin, Hansi Zeng, Guoyin Wang, Xiusi Chen, Tianxin Wei, Ruirui, Li, Zhengyang Wang, Zheng Li, Yang Li, Hanqing Lu, Suhang Wang, Jiawei Han,, Xianfeng Tang

TL;DR
This paper introduces LMIndexer, a self-supervised generative framework that learns high-quality semantic IDs for documents, improving retrieval tasks by addressing information loss and distribution mismatch issues in traditional methods.
Contribution
The paper presents LMIndexer, a novel self-supervised generative model that learns discrete semantic IDs with hierarchical structure and contrastive learning, outperforming existing approaches.
Findings
Effective semantic IDs for multiple retrieval tasks
Improved performance on recommendation, search, and retrieval datasets
High-quality IDs learned with self-supervised training
Abstract
Semantic identifier (ID) is an important concept in information retrieval that aims to preserve the semantics of objects such as documents and items inside their IDs. Previous studies typically adopt a two-stage pipeline to learn semantic IDs by first procuring embeddings using off-the-shelf text encoders and then deriving IDs based on the embeddings. However, each step introduces potential information loss, and there is usually an inherent mismatch between the distribution of embeddings within the latent space produced by text encoders and the anticipated distribution required for semantic indexing. It is non-trivial to design a method that can learn the document's semantic representations and its hierarchical structure simultaneously, given that semantic IDs are discrete and sequentially structured, and the semantic supervision is deficient. In this paper, we introduce LMIndexer, a…
Peer Reviews
Decision·ICML 2024 Poster
1. This paper presents the "LMINDEXER" approach as a solution to the challenges inherent in generating semantic IDs from textual data. The approach is carefully crafted to capture both the semantic representations and hierarchical structure of documents simultaneously. 2. The paper demonstrates the effectiveness of the LMINDEXER approach through empirical evidence gathered from experiments on three distinct downstream tasks, utilizing data from diverse domains. 3. The paper exhibits a well-org
1. While this paper presents an approach termed LMINDEXER, it's important to note that the novelty of the method is somewhat limited. Additionally, the paper lacks a comprehensive discussion of related work, including notable prior efforts that have explored the use of encoders for text encoding and decoders for reconstruction in the context of information retrieval. Several works, such as [1], [2], and [3], have examined similar techniques and deserve acknowledgment for their contributions to t
- The proposed method formulates the semantic ID learning problem as a sequence-to-sequence learning method, which is novel according to related work discussed in the paper. - Technical challenges are described clearly. - SOTA techniques are used in the proposed framework. - The experimental results show that the proposed method outperforms some SOTA methods in the three downstream tasks.
- The paper says that the proposed method "learns the document’s discrete semantic embeddings and its hierarchical structure simultaneously". But it is not clear what the authors mean by the hierarchical structure of a document, how the proposed method is guaranteed to learn such a structure, and whether the proposed method actually learns such a structure. - The size of the semantic ID (T) is set to less than or equal to 3 in the experiments, which is surprisingly small. Figure 5 shows the pe
- The paper is in general well-written and easy to follow - The approach is novel for learning semantic IDs in information retrieval, and the optimization challenges in such learning problems is highlighted
My major concerns with the paper are the weak evaluation and baselines, and overall the training seems to need a lot of bells and whistles to succeed. - DPR dual encoder is not a strong baseline; DPR is almost 10% behind SOTA dual-encoder approaches on standard benchmarks - Baselines in section 4.2 are weak since they are using an off-the-shelf text encoder and hence have no knowledge about the task; a very simple baseline that could be tried here is to train a dual-encoder model on this corpus
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques
