C2T-ID: Converting Semantic Codebooks to Textual Document Identifiers for Generative Search
Yingchen Zhang, Ruqing Zhang, Jiafeng Guo, Wenjun Peng, Sen Li, Fuyu Lv, Xueqi Cheng

TL;DR
This paper introduces C2T-ID, a novel method for creating semantically rich document identifiers that balance expressiveness and search efficiency, improving generative retrieval performance.
Contribution
C2T-ID combines hierarchical clustering with keyword extraction to generate semantic, natural language-like docids that outperform existing methods in retrieval tasks.
Findings
C2T-ID outperforms baseline docid methods on Natural Questions.
It effectively balances semantic richness with manageable search spaces.
Experimental results show improved retrieval accuracy.
Abstract
Designing document identifiers (docids) that carry rich semantic information while maintaining tractable search spaces is a important challenge in generative retrieval (GR). Popular codebook methods address this by building a hierarchical semantic tree and constraining generation to its child nodes, yet their numeric identifiers cannot leverage the large language model's pretrained natural language understanding. Conversely, using text as docid provides more semantic expressivity but inflates the decoding space, making the system brittle to early-step errors. To resolve this trade-off, we propose C2T-ID: (i) first construct semantic numerical docid via hierarchical clustering; (ii) then extract high-frequency metadata keywords and iteratively replace each numeric label with its cluster's top-K keywords; and (iii) an optional two-level semantic smoothing step further enhances the fluency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications
