CorpusLM: Towards a Unified Language Model on Corpus for   Knowledge-Intensive Tasks

Xiaoxi Li; Zhicheng Dou; Yujia Zhou; Fangchao Liu

arXiv:2402.01176·cs.CL·April 23, 2024·1 cites

CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks

Xiaoxi Li, Zhicheng Dou, Yujia Zhou, Fangchao Liu

PDF

Open Access

TL;DR

CorpusLM introduces a unified language model that integrates generative retrieval, closed-book generation, and retrieval-augmented generation to improve performance on knowledge-intensive tasks by effectively leveraging external corpora.

Contribution

The paper proposes CorpusLM, a novel unified model that combines generative retrieval and RAG with a new decoding process, enhancing knowledge-intensive task performance.

Findings

01

Outperforms existing models on KILT benchmark

02

Improves retrieval quality through ranking-oriented DocID generation

03

Enhances downstream task accuracy with unified decoding strategy

Abstract

Large language models (LLMs) have gained significant attention in various fields but prone to hallucination, especially in knowledge-intensive (KI) tasks. To address this, retrieval-augmented generation (RAG) has emerged as a popular solution to enhance factual accuracy. However, traditional retrieval modules often rely on large document index and disconnect with generative tasks. With the advent of generative retrieval (GR), language models can retrieve by directly generating document identifiers (DocIDs), offering superior performance in retrieval tasks. However, the potential relationship between GR and downstream tasks remains unexplored. In this paper, we propose \textbf{CorpusLM}, a unified language model that leverages external corpus to tackle various knowledge-intensive tasks by integrating generative retrieval, closed-book generation, and RAG through a unified greedy decoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Speech and dialogue systems

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Dense Connections · WordPiece · Dropout · Softmax · Attention Dropout