Multi-view Content-aware Indexing for Long Document Retrieval
Kuicai Dong, Derrick Goh Xin Deik, Yi Quan Lee, Hao Zhang, Xiangyang, Li, Cong Zhang, Yong Liu

TL;DR
This paper introduces MC-indexing, a content-aware, multi-view indexing method for long document retrieval that improves recall significantly without requiring training, by considering document structure and multiple content representations.
Contribution
The paper proposes a novel, training-free multi-view indexing approach that incorporates document structure and multiple content views to enhance long document retrieval performance.
Findings
MC-indexing increases recall by up to 42.8% across various retrievers.
It requires no training or fine-tuning, enabling easy integration.
Experimental results show significant improvements over existing chunking schemes.
Abstract
Long document question answering (DocQA) aims to answer questions from long documents over 10k words. They usually contain content structures such as sections, sub-sections, and paragraph demarcations. However, the indexing methods of long documents remain under-explored, while existing systems generally employ fixed-length chunking. As they do not consider content structures, the resultant chunks can exclude vital information or include irrelevant content. Motivated by this, we propose the Multi-view Content-aware indexing (MC-indexing) for more effective long DocQA via (i) segment structured document into content chunks, and (ii) represent each content chunk in raw-text, keywords, and summary views. We highlight that MC-indexing requires neither training nor fine-tuning. Having plug-and-play capability, it can be seamlessly integrated with any retrievers to boost their performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Web Data Mining and Analysis · Image Retrieval and Classification Techniques
