Recursive Abstractive Processing for Retrieval in Dynamic Datasets

Charbel Chucri; Rami Azouz; Joachim Ott

arXiv:2410.01736·cs.CL·October 3, 2024

Recursive Abstractive Processing for Retrieval in Dynamic Datasets

Charbel Chucri, Rami Azouz, Joachim Ott

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new algorithm for maintaining recursive-abstractive retrieval structures in dynamic datasets and a post-retrieval method to enhance context quality, improving retrieval performance in real-world scenarios.

Contribution

It presents a novel, efficient algorithm for updating hierarchical retrieval structures in dynamic datasets and a query-focused recursive abstractive post-processing technique.

Findings

01

Effective handling of dynamic datasets with hierarchical retrieval structures

02

Significant improvement in retrieval quality through recursive abstractive post-processing

03

Validated on real-world datasets demonstrating enhanced performance

Abstract

Recent retrieval-augmented models enhance basic methods by building a hierarchical structure over retrieved text chunks through recursive embedding, clustering, and summarization. The most relevant information is then retrieved from both the original text and generated summaries. However, such approaches face limitations with dynamic datasets, where adding or removing documents over time complicates the updating of hierarchical representations formed through clustering. We propose a new algorithm to efficiently maintain the recursive-abstractive tree structure in dynamic datasets, without compromising performance. Additionally, we introduce a novel post-retrieval method that applies query-focused recursive abstractive processing to substantially improve context quality. Our method overcomes the limitations of other approaches by functioning as a black-box post-retrieval layer compatible…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

In general, document updating is a very common problem in document indexing, therefore it is useful to develop online indexing algorithms which can effectively address the dynamic dataset problem. And online clustering algorithms are reasonable solutions for this problem.

Weaknesses

My main concerns about this paper are as follows: 1. The contribution is incremental, the proposed method is an extension of previous recursive abstractive indexing algorithms; 2. The two contributions, adRAP and postQFRAP, are separated. adRAP addresses the dynamic dataset problem and postQFRAP addresses the online organization problem of retrieved documents. The paper should focus on one main contribution; 3. The proposed adRAP algorithm is just an online clustering algorithm with tree structu

Reviewer 02Rating 3Confidence 3

Strengths

1. The method presented by the authors is mathematically interesting and seems sound (equation 3). 2. I like the pseudo-code written by the authors, which improves clarity of the paper.

Weaknesses

1. I am not sure whether dynamic data for RAG is a high-impact research problem. First of all, building a tree for the entire dataset is a one-shot process, so I think the efficiency bottleneck is the inference speed. Secondly, if we just add a small number of documents, we can simply assign those documents to existing clusters (your way of adapting RAPTOR) or create a new cluster to hold these newly added documents. I guess the performance will not be affected significantly due to the small num

Reviewer 03Rating 3Confidence 3

Strengths

- ADRAP addresses the most important limitation of the RAPTOR method. It is computationally efficient. As shown by Table 10, it's faster and requires fewer summarization calls than recomputing entire tree with RAPTOR. - Both models have performance close to RAPTOR.

Weaknesses

- Evaluation Metrics: Authors have used indirect metrics for evaluations. LLM Judge is not the most reliable evaluation protocol if it is possible to perform direct comparisons. You should consider standard metrics like accuracy (QuALITY)/ F1 (QASPER) for better understanding. - Baselines: ADRAP- It would be useful to include two baselines 1. where we do not update the RAPTOR tree and just retrieve from old RAPTOR tree and new documents, 2. construct a new raptor tree only on new documents a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Data Mining Algorithms and Applications